Having settled on a suitable definition of probability, we now focus on deriving some of probability’s important properties. Along with being useful computational tools, these properties help us better understand how probability functions behave. One remarkable consequence of Kolmogorov’s definition (Definition 3.4 )is that all these properties can be derived from just three axioms. Since the naive defition of probability (Definition 1.4) is an example of a probability function (see Theorem 3.1), these properties apply to probabilities computed from the naive definition as well.
Throughout our derivations will consider a fixed probability function \(P\) and will often visualize sample spaces and events with diagrams like Figure 4.1 below
In Figure 4.1, the size of the event \(A\) (the area of the region covered in blue) is meant to be representative of the probability \(P(A)\) that \(P\) assigns to \(A\). Since Kolmogorov’s second axiom (see Definition 3.4) requires that \(P(\Omega) = 1\), we’ll imagine that the area of the whole sample space is one. Therefore \(P(A)\) can be thought of as the proportion of the sample space’s area that the event \(A\) takes up.
To simplify our exposition, we will use diagrams similar to Figure 4.1 to justify simple set theoretic claims (e.g, an event \(A\) and it’s complement \(A^c\) are always disjoint). These claims can and should be verified with proofs, and we have actually already asked you to formally verify all the claims we use in Exercise 4.1.
Complement Rule
The complement rule tells us that the probability of an event \(A\) and it’s complement \(A^c\) sum to one.
Proposition 4.1 (The complement rule) For any event \(A\),
\[P(A^c) = 1 - P(A).\]
Intuitively the statement is true because, as Figure 4.2 displays, the sum of the area covered by \(A\) and \(A^c\) is equal to the area covered by the sample space \(\Omega\), which is one. We give a formal proof below.
Consider an a event \(A\) and its complement \(A^c\), depicted below:
It is apparent that (1) \(A\) and \(A^c\) are disjoint and (2) the union of \(A\) and \(A^c\) is \(\Omega\). Since \(A\) and \(A^c\) are disjoint, Axiom 3 (see Definition 3.4) implies that \[
P(A \cup A^c) = P(A) + P(A^c).
\] Furthermore, since \(A \cup A^c = \Omega\) and Axiom 2 (see Definition 3.4) tells us that \(P(\Omega) = 1\), we have \[
P(A) + P(A^c) = P(A \cup A^c) = P(\Omega) = 1.
\] Subratcting \(P(A)\) to the other side we get that \(P(A^c) = 1 - P(A)\).
A lot of the time, it is easier to compute the probability of an event’s complement than the probability of the event itself. In these situations, the complement rule is extremely handy. Although we didn’t refer to it explicitly, we actually secretly used it when we solved Cardano’s problem (Example 2.6) and the birthday problem (Example 2.7) in Chapter 2. Example 4.1 below provides another case where the complement rule greatly simplifies a problem.
Example 4.1 (At least one baby returned incorrectly) Consider again Example 3.3, where we randomly return \(n\) babies to their corresponding \(n\) sets of parents. As before, let \(E_i\) be the event that the \(i\)th couple gets their correct baby back. If each way of returning the babies is equally likely, what is the probability that at least one baby is returned incorrectly?
The complement of the event that at least one baby is returned incorrectly is the event that all babies are returned correctly, which clearly only has one outcome. Using the complement rule (Proposition 4.1) and recalling from Example 2.4 that the sample space has \(n!\) outcomes, we can easily compute that \[P(\text{at least one baby returned incorrectly}) = 1 - P(\text{all babies returned correctly}) = 1 - \frac{1}{n!}. \] Even for just \(n=4\), there’s already above a \(95\%\) chance that someone’s baby will be returned incorrectly!
We point out that the event of at least one baby being returned incorrectly can be written as \(\bigcup_{i=1}^n E_i^c\), the union of many simpler events. The fact that it’s complement is the event that all the babies are returned correctly, i.e., \[
\left( \bigcup_{i=1}^n E_i^c \right)^{c} = \bigcap_{i=1}^n E_i
\]
is an example application of De Morgan’s laws from Exercise 3.4. Whenever you’re trying to compute the probability of a union of many events, it’s always a good idea to try and apply the complement rule along with DeMorgan’s laws.
Kolmogorov’s definition (Definition 3.4) never explicitly guarantees that probabilities are between zero and one. This, however, is an immediate consequence of the complement rule.
Proposition 4.2 (Probabilities are between zero and one) For any event \(A\),
\[0 \leq P(A) \leq 1.\]
Axiom 1 (see Definition 3.4) tells us that \(P(A) \geq 0\). The complement rule (Proposition 4.1) tells us that \(P(A) = 1 - P(A^c)\). Since Axiom 1 also tells us that \(P(A^c) \geq 0\), it must also be the case that \(P(A) \leq 1\).
Subset Rule
The subset rule tells us that if the event \(A\) happens whenever the event \(B\) happens, then the probability of \(A\) is at least as large as the probability of \(B\).
Proposition 4.3 (Subset rule) For any events \(A\) and \(B\) such that \(B \subseteq A\), \[P(B) \leq P(A).\]
The statement is intuitively true because, as Figure 4.3 displays, when \(B\) is a subset of \(A\) it cannot cover a larger portion of the sample space than \(A\). We give a formal proof below.
As we illustrate below, when \(B \subseteq A\), we can write \(A\) as the disjoint union of \(B\) and \(A \cap B^c\).
By Axiom 3 (see Definition 3.4) we have that \[P(A) = P(B) + P(A \cap B^c) \] Since Axiom 1 (see Definition 3.4) guarantees that \(P(A \cap B^c) \geq 0\), it must be the case that \(P(A) \geq P(B)\).
We give an example application of the subset rule below.
Example 4.2 (Software engineers with and without computer science degrees) Suppose we randomly select a person from the United States. Of the two events, which is more likely: (1) the person we select is a software engineer (SWE) or (2) the person we select has a degree in computer science (CS) and is a software engineer?
A common fallacy is to think the second event is more likely, because once someone has a degree in computer science they are more likely to become a software engineer. More practically however, there are at least as many software engineers as software engineers with degrees in computer science. The later group is a subset of the first. Likewise, the second event is a subset of the first one, and the subset rule (Proposition 4.3) tells us it cannot be more likely \[
P(\text{SWE and CS degree}) \leq P(\text{SWE})
\]
Note that this statement is true for any probability function \(P\), meaning it remains true even if we’re more likely to select certain people from the United States population than others.
Now, it may certainly be the case that, given that we’ve selected someone with a computer science degree, the chance that they are a software engineer is larger than the chance of simply selecting a software engineer. This statement, however, is not the same as the one we made above. Rather, it is statement involving conditional probabilities. We’ll continue to revisit this example as we cover conditional probabilities in the next few chapters.
Addition Rule
The addition rule tells us how to compute the probability of the union of two events, even when the events are not mutually exclusive (disjoint).
Proposition 4.4 (Addition rule) For any events \(A\) and \(B\),
\[P(A \cup B) = P(A) + P(B) - P(A \cap B).\]
Looking at the left hand side of Figure 4.4, \(P(A \cup B)\) can be thought of as the total area covered by the two circles. Intuitively we see that adding together \(P(A)\) and \(P(B)\) double-counts the area in the circles’ intersection. Subtracting off \(P(A\cap B)\) then corrects for this overcounting. We formalize this argument below.
To make use of Axiom 3, we need to express \(A \cup B\) as a union of sets that are necessarily disjoint. As illustrated below, \(A \cup B\) can be written as the disjoint union of \(A\) and \(A^c \cap B\), the part of \(B\) that is not in \(A\).
So by Axiom 3: \[P(A \cup B) = P(A \cup (A^c \cap B)) = P(A) + P(A^c \cap B).\]
From here, we just need to show that \(P(A^c \cap B) = P(B) - P(A \cap B)\). But again, we can write \(B\) as the disjoint union of \(A \cap B\), the part of \(B\) that is in \(A\), and \(A^c \cap B\), the part of \(B\) that is not in \(A\):
By Axiom 3, \(P(A \cap B) + P(A^c \cap B) = P(B)\). By moving \(P(A \cap B)\) to the other side, we obtain the desired result.
The addition rule describes a fundamental relationship between \(P(A)\), \(P(B)\), \(P(A \cap B)\), and \(P(A \cup B)\). If we know or can compute three of the previous quantities, we immediately can find the fourth. We give two examples of how to use the addition rule below.
Example 4.3 (At least one of first two babies returned correctly) Again consider the random babies example that we used earlier in Example 4.1. If all the ways of returning the babies are equally likely, what is the probability of the event \(E_1 \cup E_2\) that at least one of the first two babies is returned correctly?
According to the addition rule (Proposition 4.4)
\[P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2).\]
Thus, it suffices to compute the probabilities \(P(E_1)\), \(P(E_2)\), and \(P(E_1 \cap E_2)\). Because it will be relevant later in the chapter, we go ahead show that the probability \(P(E_1 \cap \dots \cap E_k)\) of correctly returning the first \(k\) babies is \((n-k)!/n!\).
Recall from Example 2.4 there are \(n!\) different ways to return all the babies. For an outcome to be in \(E_{1} \cap \dots \cap E_{k}\), we know that the \(1\)st through \(k\)th babies must be returned to their biological parents. These outcomes can therefore be identified by how the remaining \(n-k\) babies are returned to the remaining \(n-k\) sets of parents. The same reasoning as in Example 2.4 tells us there are \((n-k)!\) ways of returning these remaining babies, each corresponding to a way of ordering them. There are therefore \((n-k)!\) outcomes in \(E_{1} \cap \dots \cap E_{k}\). Since all outcomes are equally likely,
\[P(E_{1} \cap \dots \cap E_{k}) = (n-k)!/n!\].
By plugging in \(k=1\) and \(k=2\), we can use this intermediary result to compute \(P(E_1)\) and \(P(E_1 \cap E_2)\). And by symmetry, we know that \(P(E_1)\) and \(P(E_2)\) are equal.
Sometimes we will claim that probabilities are equal “by symmetry”. When we say that two probabilities are equal by symmetry, it means that the same argument we use to compute one of them can be directly applied to compute the other. In this case, we used the fact that \(P(E_1) = (n-1)!/n!\) to determine that \(P(E_2)\) must be \((n-1)!/n!\) also. Based on the problem set-up, there’s nothing special about the first baby. It’s “symmetric” with the second baby in the sense that we could do the identical analysis for the second baby and get the same result.
Noting that \((n-1)!/n! = 1/n\) and \((n-2)!/n! = 1/(n(n-1))\), we can now compute that
\[\begin{align}
P(E_1 \cup E_2) &= P(E_1) + P(E_2) - P(E_1 \cap E_2) & \text{(addition rule)}\\
&= \frac{1}{n} + \frac{1}{n} - \frac{1}{n(n-1)} & \text{(intermediary result)}\\
&= \frac{2n - 3}{n(n-1)} & \text{(simplify)}
\end{align}\]
As a sanity check, our answer gives \(P(E_1 \cup E_2) = 1/2\) when \(n=2\). Indeed there are two ways of returning the babies when \(n=2\), and only in one of them is at least one baby returned correctly (the babies are either both returned correctly or incorrectly).
Example 4.4 (Bi-linguality in India) According to the 2011 Census of India, \(57.1\%\) of the people in India can speak Hindi and \(10.6\%\) of them can speak English. If we think that the percentage of people who speak both is in the \(2\%\) to \(5\%\) range, what can we determine about the percentage of people who speak at least one of Hindi or English?
Imagine randomly selecting a person from the Indian population, with each person being equally likely to be selected. When each person is equally likely to be selected, the probability of the selected person having a certain characteristic is simply the proportion of people in the population with that characteristic (be sure to carefully convince yourself that this is the case!). Thus, we know the following probabilities regarding whether our randomly selected person will be able to speak certain (combinations of) languages: \[
P(\text{Hindi}) = 0.571, \ \ P(\text{English}) = 0.106, \ \ 0.02 \leq P(\text{Hindi and English}) \leq 0.05.
\] Since we have information about three of the four quantities in the addition rule (Proposition 4.4), we can use the addition rule to get a handle on the fourth:
\[\begin{align*}
&P(\text{Hindi or English}) & \\
&= P(\text{Hindi}) + P(\text{English}) - P(\text{Hindi and English}) & \text{(addition rule)}\\
&= 0.571 + 0.106 - P(\text{Hindi and English}) & \text{(substitute in known values)}\\
&= 0.677 - P(\text{Hindi and English}) & \text{(simplify)}
\end{align*}\]
Because \(0.02 \leq P(\text{Hindi and English}) \leq 0.05\), it must be the case that
\[
0.627 \leq P(\text{Hindi and English}) \leq 0.657.
\]
Essentially, we can determine that between \(62.7\%\) and \(65.7\%\) of the people in India can speak at least one of Hindi or English.
Union Bound and Inclusion-Exclusion
In essence, the addition rule relates the probability of a union of two events to the probability of their intersection. Since the probabilities of intersections of events are often easily computable (we’ll see why more so in the next chapter), it can be a useful tool. Naturally, we may ask if we can generalize it to the union of more than two events. To get some intuition, let’s look at the below collection of three events:
Conceptually we can think of the probability \(P(A \cup B \cup C)\) as the area of the region covered by the three circles. If we add together the areas of the circles \(A\), \(B\) and \(C\), this will double-count the area in the intersections \(A \cap B\), \(B \cap C\), and \(A \cap C\) and triple-count the area in the intersection \(A \cap B \cap C\). Therefore the sum of \(P(A)\), \(P(B)\), and \(P(C)\) should be an upper bound for \(P(A \cup B \cup C)\). This is the idea behind the union bound, which we state for the general case of \(n\) events below. We ask you to prove the union bound formally in Exercise 4.2.
Proposition 4.5 (Union bound) For a collection of events \(A_1, \dots, A_n\) \[P \left( \bigcup_{i=1}^n A_i \right) \leq \sum_{i=1}^n P(A_i).\]
In the following example, we show how the union bound helps us bound the probability of a bad event happening.
Example 4.5 (Quality control for medical devices) Suppose we’re manufacturing batches of \(n=50\) medical devices. We want to be sure that, at least \(99\%\) time, none of the devices in our batch fail. If each device we make has some failure probability \(p\), how low does \(p\) need to be for us to guarantee this?
We want to ensure that \(P(\text{no device fails}) \geq 0.99\). Intuitively we should be able to guarantee this by ensuring the probability that some device fails is small. Indeed, by the complement rule (Proposition 4.1)
\[ P(\text{no device fails}) = 1 - P(\text{some device fails}),\]
so \(P(\text{no device fails}) \geq 0.99\) if and only if \(P(\text{some device fails}) \leq 0.01\). Since the event of some device failing can be written as a union of the events \(F_i\) that the \(i\)th device fails, we can use the union bound to bound the probability of it happening:
\[
P(\text{some device fails}) = P \left(\bigcup_{i=1}^{50} F_i \right) \leq \sum_{i=1}^{50} P(F_i) = 50p
\]
Therefore, if we take \(p \leq 0.01/50 = 0.0002\), then we ensure that \(P(\text{some device fails}) \leq 0.01\).
We may ask, is it neccesary to have \(p \leq 0.0002\)? Technically, it is. If the failure events \(F_i\) are mutually exclusive (i.e., no two devices in the same batch ever fail at the same time), then by Axiom 3
\[
P(\text{some device fails}) = P\left(\bigcup_{i=1}^{50} F_i \right) = \sum_{i=1}^{50} P(F_i) = 50p
\]
In this case, if \(p > 0.0002\), then \(P(\text{some device fails}) > 0.01\) and \(P(\text{no device fails}) < 0.99\).
You may rightfully feel that treating the failure events as mutually exclusive is unrealistic and yields an unfavorable worst case analysis. In Chapter 6 we’ll try and refine this analysis by making reasonble assumptions about how the failures of different devices do/do not relate to one another.
While the union bound was useful in the above example, it can often give inadequate results.
Example 4.6 (At least one baby returned correctly) Again consider the random babies example that we used in Example 4.1 and assume that each way of returning the babies is equally likely. Can we get a handle on the probability that at least one baby is returned correctly?
Because the event that at least one baby is returned correctly can be written as a union \(\bigcup_{i=1}^n E_i\) of simpler events, we should first try the strategy from Example 4.1 of applying the complement rule. The complement of the event that at least one baby is returned correctly is the event that no babies are returned correctly. Unfortunately there are many ways to return every baby incorrectly, and it is difficult to count all of them. In this case, the complement rule is not helpful.
Even though the complement rule didn’t work, we can at least try and bound the probability via the union bound (Proposition 4.5). The union bound tells us that
\[P \left(\bigcup_{i=1}^n E_i\right) \leq \sum_{i=1}^n P(E_i)\]
In Example 4.3 we effectively argued that \(P(E_i) = 1/n\) for any \(i\), so the union bound reduces to
\[P\left(\bigcup_{i=1}^n E_i\right) \leq \underbrace{\frac{1}{n} + \dots + \frac{1}{n}}_{n \text{ times}} = 1\]
Since we already know that all probabilities are less than or equal to one (see Proposition 4.2), this bound is totally useless!
To get a more satisfactory answer to the problem posed in Example 4.6 we’ll try and refine our union bound result. Let’s take a closer look at Figure 4.6, drawn again below:
As discussed before, adding together \(P(A)\), \(P(B)\), and \(P(C)\) double-counts the areas in the intersections \(A \cap B\), \(A \cap C\). We could try and adjust for this by subtracting off the probability of each intersection once:
\[P(A) + P(B) + P(C) - P(A \cap B) - P(B \cap C) - P(A \cap C) \]
Taking a closer look, we see that the area in the intersection \(A \cap B \cap C\), has now been added in three times by \(P(A)\), \(P(B)\), and \(P(C)\), but then also removed three times by \(-P(A \cap B)\), \(-P(B \cap C)\), and \(-P(A \cap C)\). Therefore, we have to add it back in one more time, giving a final answer of
\[P(A) + P(B) + P(C) - P(A \cap B) - P(B \cap C) - P(A \cap C) + P(A \cap B \cap C) \]
In our final answer, each sub-region of the region covered by the three circles has had its area accounted for exactly once.
The inclusion-exclusion principle, extends this result to the union of \(n\) events. For sake of brevity, we omit its proof, which is just a generalization of the argument we’ve sketched out for the case of three events.
Theorem 4.1 (Inclusion-exclusion principle) For a collection of events \(A_1, \dots, A_n\),
\[
\begin{align*} P \left( \bigcup_{i=1}^n A_i \right) &= \sum_{i=1}^n P(A_i) - \sum_{1 \leq i < j \leq n } P(A_i \cap A_j) \\
&\qquad + \sum_{1 \leq i < j < k \leq n} P(A_i \cap A_j \cap A_k) - \dots + (-1)^{n+1} P(A_1 \cap \dots \cap A_n) \end{align*}
\tag{4.1}\]
Theorem 4.1 is a notationally dense result, so let’s take a moment to parse it carefully. Let’s take a closer look at the right hand side of Equation 4.1
\[ \textcolor{orange}{\underbrace{\sum_{i=1}^n P(A_i)}_{(1)}} - \textcolor{magenta}{\underbrace{\sum_{1 \leq i < j \leq n } P(A_i \cap A_j)}_{(2)}} + \textcolor{purple}{\underbrace{\sum_{1 \leq i < j < k \leq n} P(A_i \cap A_j \cap A_k)}_{(3)}} - \dots + (-1)^{n+1} \textcolor{brown}{\underbrace{P(A_1 \cap \dots \cap A_n)}_{(n)}}\]
Since each of the terms \(\textcolor{orange}{(1)}, \textcolor{magenta}{(2)}, \textcolor{purple}{(3)}, \dots, \textcolor{brown}{(n)}\) are themselves sums, we can think of the whole expression as an alternating sum of sums. It’s alternating in the sense that we start with the sum \(\textcolor{orange}{(1)}\), then we subtract the sum \(\textcolor{magenta}{(2)}\), then we add the sum \(\textcolor{purple}{(3)}\), so on and so forth. The \(\ell\)th term is essentially a sum over the different ways one can choose \(\ell\) events from the collection of \(n\) events in the union. As such, Corollary 2.3 tells us that the \(\ell\)th term is the sum of \({{n}\choose{\ell}}\) different probabilities (the last term \(\textcolor{brown}{(n)}\) is the sum of just \({{n}\choose{n}} = 1\) probability). As a concrete example, if \(n=4\), then the second term \(\textcolor{magenta}{(2)}\) can be written out as
\[
\begin{align*}
\textcolor{magenta}{\sum_{1 \leq i < j \leq n } P(A_i \cap A_j)} &= \textcolor{magenta}{P(A_1 \cap A_2) + P(A_1 \cap A_3)+ P(A_1 \cap A_4)} \\
&\qquad \textcolor{magenta}{ + P(A_2 \cap A_3)+ P(A_2 \cap A_4)+ P(A_3 \cap A_4)}.
\end{align*}
\]
We can use the inclusion-exclusion principle to exactly compute the probability that at least one baby is returned correctly in our random babies example.
Example 4.7 (At least one baby returned correctly) Let’s revisit Example 4.6 and try and exactly compute the probability of the event \(\bigcup_{i=1}^n E_i\) that at least one baby is returned correctly.
We’ll compute this probability directly using the inclusion-exclusion principle. Recall that in Example 4.3 we argued the probability \(P(E_{1} \cap \dots \cap E_{k})\) that the first \(k\) babies are returned correctly is \((n-k)!/n!\). By symmetry, the probability that some other specific collection of \(k\) babies is returned correctly should also be \((n-k)!/n!\). With this under our belt, we compute our quantity of interest:
\[\begin{align*}
&P \left( \bigcup_{i=1}^n E_i \right)\\
&= \sum_{i=1}^n P(E_i) - \sum_{1 \leq i < j \leq n } P(E_i \cap E_j) + \dots + (-1)^{n+1} P(E_1 \cap \dots \cap E_n) & \text{(inclusion-exclusion)} \\
&= \sum_{i=1}^n \frac{(n-1)!}{n!} - \sum_{1 \leq i < j \leq n } \frac{(n-2)!}{n!} + \dots + (-1)^{n+1} \frac{0!}{n!} & \text{(symmetry)} \\
&= {{n}\choose{1}} \frac{(n-1)!}{n!} - {{n}\choose{2}}\frac{(n-2)!}{n!} + \dots + (-1)^{n+1} \frac{1}{n!} & \text{(counting)} \\
&= \frac{n!}{1! (n-1)! } \frac{(n-1)!}{n!} - \frac{n!}{2!(n-2) } \frac{(n-2)!}{n!} + \dots + (-1)^{n+1} \frac{1}{n!} & \text{(} n \text{ choose } k \text{ definition)}\\
&= 1- \frac{1}{2!} + \frac{1}{3!} - \dots + (-1)^{n+1} \frac{1}{n!} & \text{(cancellations)}
\end{align*}\]
To the trained eye, this final expression will look familiar. Recall the Taylor expansion of the function \(e^{x}\):
\[e^{x} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots \]
Plugging in \(x=-1\) gives that
\[\frac{1}{e} = \frac{1}{2!} - \frac{1}{3!} + \frac{1}{4!} + \dots \]
When \(n\) is large, it should therefore be the case that
\[ P \left( \bigcup_{i=1}^n E_i \right) \approx 1 - \frac{1}{e} \approx 0.632\]
with the approximation becoming more and more exact as \(n \rightarrow \infty\). It turns out that this approximation is very good, even for seemingly small values of \(n\). By the time that \(n \geq 4\), the approximation is already within \(0.01\) of the exact, actual value.
Example 4.7 is a variation of the famous matching problem posed by French mathematician Pierre Rémond de Montmort in 1708. Montmort asked, if we shuffle a deck of \(n\) playing cards, what’s the probability that at least one card will end up in the same position that it started in? If we imagine that each baby represents a playing card, Example 4.7 exactly answers Montmort’s question.
Although it didn’t work for Montmort’s matching problem, we still emphasize that the complement rule (Proposition 4.1) will often allow us to easily compute the probability of unions, as was the case in Example 4.1. If you’re thinking about using the inclusion-exclusion principle, we always suggest trying the complement rule first and then turning to inclusion-exclusion as a last resort.
Exercises
Exercise 4.1 Coming Soon!
Exercise 4.2 Coming soon!