This chapter covers the Law of Total Probability and Bayes Rule, two computational tools that use conditional probabilities. As the examples in this chapter will show, aside from being essential for computing probabilities, these tools elucidate probability in everyday life.
7.1 Law of Total Probability
In the 2021 NCAA women’s basketball tournament, Stanford defeated South Carolina to advance to the championship, where they would play the winner of the other semifinal between Arizona and UConn. If you were asked at this moment (after Stanford won its semifinal, but before Arizona and UConn’s semifinal) for the probability that Stanford wins the championship, how would you determine it?
It is most natural to think about the probability that Stanford will beat a particular opponent. Perhaps Stanford matches up well against UConn and has an \(85\%\) chance of winning if UConn advances to the championship game. But Arizona is a tougher opponent, and Stanford only has a \(60\%\) chance of winning if Arizona advances.
These are probabilities of Stanford winning conditional on either UConn or Arizona making the championship. The final answer, the unconditional probability of Stanford winning the tournament, should also account for the uncertainty in whether Stanford will play UConn or Arizona. If UConn is very likely to beat Arizona, then the probability should be close to \(85\%\). On the other hand, if Arizona is more likely to beat UConn, then the probability should be closer to \(60\%\).
The Law of Total Probability (LoTP) allows us to compute an unconditional probability from the conditional probabilities. It says that we should take a weighted average of the conditional probabilities that Stanford wins against a specific opponent, where the weights are the probabilities that they play each opponent:
To be concrete, suppose that there is a \(90\%\) chance that UConn beats Arizona. Then, the Law of Total Probability says that the probability that Stanford wins the championship is
\[ P(\text{Stanford wins}) = 0.90 \cdot 0.85 + 0.10 \cdot 0.60 = 0.825,\] which is very close to the conditional probability that Stanford beats UConn.
On the other hand, Arizona has a \(3/4\) chance of beating UConn, then the Law of Total Probability says that the probability that Stanford wins the championship is
\[ P(\text{Stanford wins}) = 0.25 \cdot 0.85 + 0.75 \cdot 0.60 = 0.6625,\] which is closer to \(60\%\).
Prior to formally stating and proving the Law of Total Probability, we recall that a partition of the sample space \(\Omega\) is a disjoint collection of sets whose union makes up the whole sample space (see Figure 7.1 below).
Figure 7.1: Three events \(B_1\), \(B_2\), and \(B_3\) that partition the sample space \(\Omega\).
The Law of Total Probability tells us that, if \(B_1, B_2 \dots\) partition the sample space, then the probability \(P(A)\) of any event \(A\) is the weighted average of the conditional probabilities \(P(A|B_i)\) of \(A\) given \(B_i\), where the weights are the probabilities \(P(B_i)\) of each \(B_i\) happening.
Theorem 7.1 (Law of Total Probability) Consider a collection of positive probability events \(B_1, B_2, ...\) that partition the sample space \(\Omega\). Then, for any event \(A\), \[
P(A) = \sum_i P(B_i) P(A | B_i)
The sets \(A \cap B_i\) are mutually exclusive, and their union is \(A\). Therefore, by Axiom 3 of Definition 3.4,
P(A) = \sum_{i=1}^n P(A \cap B_i).
But by the multiplication rule (Theorem 5.1), each term inside the sum can be expanded as
P(A \cap B_i) = P(B_i) P(A | B_i).
Substituting this into the sum above, we obtain Equation 7.1: \[
P(A) = \sum_{i=1}^n P(B_i) P(A | B_i).
The Law of Total Probability is most useful when the conditional probability of \(A\) is known under every possible circumstance.
Example 7.1 (Probability the second card is a spade) Recall the two spades example from Example 5.5. It is clear that \(P(S_1)\), the probability that the first card is a spade, is \(13/52\). But what is the probability the second card is a spade?
This probability is easy to determine if we know whether or not the first card was a spade.
If the first card was a spade, then \(P(S_2 | S_1) = 12/51\).
If the first card was not a spade, then \(P(S_2 | S_1^c) = 13/51\).
But what is the unconditional probability the second card is a spade, \(P(S_2)\), if we have no knowledge about the first card?
The answer is \(P(S_2) = 13/52\). The answer must be this because every card is equally likely to be anywhere in the deck, so there is no reason the second card is any more or less likely to be a spade than the first card. This is called a symmetry argument.
But in case you are not convinced by symmetry, here is a calculation using the Law of Total Probability. Note that \(S_1\) and \(S_1^c\) are a partition of the sample space. Therefore, by Equation 7.1:
One way to understand what the Law of Total Probability is doing is to sketch a probability tree. It calculates the total probability of all paths that lead to \(S_2\) (as opposed to \(S_2^c\)).
We will not draw probability trees for the remaining examples, but they can be useful when applying the Law of Total Probability to a partition with only a few events.
In our next example, we use the Law of Total Probability to calculate the probability of winning a pass-line bet in craps. It illustrates the most common way to use the Law of Total Probability: conditioning on what you wish you knew.
Example 7.2 (Winning a pass-line bet) What’s the probability of winning a pass-line bet in craps (see relevant part of Chapter 1 for a review of craps)?
If we knew the come-out roll, then we’d already be done. We previously computed the probability of winning a pass-line bet conditional on each different come-out roll in Example 6.7. The table below recalls these values, and also gives the unconditional probability of each come out roll (you can compute these by looking at Figure 1.7).
\(P(\text{come-out roll is } i)\)
\(P(\text{win} | \text{come-out roll is } i)\)
Because the different come-out rolls partition the sample space, we can apply the Law of Total Probability: \[ P(\text{win}) = \sum_{i=2}^{12} P(\text{come-out roll is } i) P(\text{win} | \text{come-out roll is } i). \]
Somewhat magically, the Law of Total Probability allows us to condition on the come-out roll (what we wish we knew) despite us not yet knowing what it is!
This sum is most easily evaluated by a computer.
The probability is about \(49.29\%\). Therefore, the casino has a small but definite house edge!
In the next example we compute the probability that a COVID-19 antigen test gives the correct reading. Again, the Law of Total Probability allows us to condition on what we wish we knew.
The Law of Total Probability also comes in handy when we analyze a random process that repeats itself or starts over. By conditioning on the different things that can happen until it restarts, we can obtain a recursive formula that allows us to calculate a probability of interest.
Example 7.3 (Branching Process) Amoebas are a single-celled organisms that reproduce asexually by dividing into two cells. Suppose the world starts with just one amoeba. At the end of one minute the amoeba either dies, stays alive, or splits into two. This process then continues with the remaining amoebas. If the three possibilities are always equally likely and each amoeba behaves independently of the other amoebas and its previous self, what’s the probability that the amoeba population will totally die out?
Let’s call \(p\) the probability of the event \(E\) that that entire amoeba population totally dies out. After one “turn” of this process, three events could have happened:
D(ie): The first amoeba dies and the population has died out.
L(ive): The first amoeba survives and process starts over.
S(plit): The first amoeba splits into two children, and now two new versions of the process start from the beginning.
Because these three events partition the sample space, we can apply the Law of Total Probability in hopes that we can get an expression for \(p\) in terms of itself:
Let’s examine each one of these conditional probabilities separately:
\(P(E|D)\): Given that the first amoeba dies on the first turn, the population has surely died out. This conditional probability is \(1\).
\(P(E|L)\): When the first amoeba lives, the proces starts over. Because the first amoeba behaves independently of its previous self, there’s truly no difference between this process and the original process we started with, so the conditional probability that the whole population goes extinct should still be \(p\).
\(P(E|S)\): When the parent ameoba splits into two children, we now have two versions of the process that essentially start from the beginning. Let \(E_1\) be the event that the first child’s lineage dies out, and \(E_2\) be the same for the second child. Conditional on \(S\), the whole population goes extinct if and only if both lineages die out. Also conditional on \(S\), the two children behave independently of each other (as will their children, so on and so forth) so the lineages dying out are conditionally independent events given \(S\). Therefore, \[\begin{align*}
P(E|S) &= P(E_1 \cap E_2 | S) & \text{(same event conditional on } S \text{)}\\
&= P(E_1 | S)P(E_2 | S) & \text{(conditional independence)}\\
&= p^2. & \text{(same reasoning as } P(E|L) \text{)}
Plugging these values back into our earlier Law of Total Probability computation, we find that \[p = \frac{1}{3} + \frac{1}{3}p + \frac{1}{3}p^2,\] so the solution \(p\) must satisfy the polynomial equation \[\frac{1}{3}p^2 - \frac{2}{3}p + \frac{1}{3} = 0.\] This equation has two solutions, \(p = -2\) and \(p=1\). We know that \(p\) must be between zero and one because it is a probability, and therefore \(p=1\) must be the correct answer. The amoeba population will die out with probability one!
We ask you to explore what happens when we use different probabilities of dying, staying alive, or splitting into two in Exercise 7.1.
Note on formally computing \(P(E|L)\)
Our computation of \(P(E|L)\) (and therefore also \(P(E|S)\)) was a little more informal than usual. We didn’t apply the definition of conditional probability (Definition 5.1) like we typically do, and instead said that \(P(E|L)\) should equal \(P(E)\) based on intuitive arguments. In such recursive problems, justifying things more formally is tricky and beyond the scope of this book. Arguments like the one we presented are sufficient for our sake, and they’ll get you the right answer!
The Law of Total Probability also offers a simple explanation of Simpson’s paradox, a paradox where a trend that appears in several sub-populations disappears or even reverses when you look at the whole population. The paradox was first described in detail in by British codebreaker and statistician Edward H. Simpson in 1951.
Example 7.4 (Race and the dealth penalty in Florida) An important example of Simpson’s paradox can be found in the 1991 article “Choosing Those Who Will Die: Race and the Death Penalty in Florida”, which studies how races affects the chances of receiving the death penalty for homicide convictions in Florida.
At first glance, data from the study shows that many more White defendants (\(53\)) were given the death penalty than Black defendants (\(15\)). These sorts of aggregate satistics, however, are often used to mislead people. They don’t account of the fact that, because Florida’s White population is much larger than its Black population, a much larger number of defendants in the study are White. Hence White defendants can make up most of the death penalty receipients even if they receive the death penalty at a much lower rate than Black defendants.
According to the study, however, White defendants also received the death penalty at a higher rate than Black defendants. Table 7.1 shows the results.
Table 7.1: Florida’s observed death penalty rate for different racial groups in 1991. The racial group with higher death penalty rate is highlighted in green, and higher number of defendants in yellow.
# Defendants
% Death Penalty
# Defendants
% Death Penalty
So are White defendants being treated unfairly? Not quite. Although our most recent analysis is more refined, it still doesn’t paint a full picture. To get to the bottom of things, we’ll have to take an even more detailed look at the data. In Table 7.2, we organize cases by whether the homicide victim was Black or White.
Table 7.2: Florida’s observed death penalty rate for different in 1991, now further categorized by race of the homicide victim. For both races of homicide victims, the racial group with higher death penalty rate is highlighted in green, and higher number of defendants in yellow.
Black Defendant
White Defendant
# Defendants
% Death Penalty
# Defendants
% Death Penalty
Black Victim
White Victim
Table 7.2 tells a different story. When we specify the victim’s race as Black or White, Black defendants have a higher death penalty rate in both cases! How can this be possible if the overall death penalty rate for Black defendants is lower than for White defendants?
The driving force here is that cases with Black victims are much less likely to result in the death penalty. Because cases with Black defendants tend to have Black victims and cases with White defendants tend to have White victims, the overall rate at which Black defendants are given the death penalty ends up being lower.
The Law of Total Probability(Theorem 7.1) illustrates what’s going on very clearly. If we randomly select a Black defendant from our pool, then
We see that despite a random Black defendant being more likely to get the death penalty both when the victim is Black or White, they’re less likely to get the death penalty overall.
We point out in closing that, if anything, the group(s) being treated unfairly are Black defendants and Black victims, which is very counterintuitive given our initial analyses.
We provide another example of Simpson’s paradox as optional reading.
UC Berkeley’s 1973 graduate admissions
Example 7.5 (UC Berkeley’s 1973 graduate admissions) One of the most famous examples of Simpson’s paradox is from the 1975 article “Sex Bias in Graduate Admissions: Data from Berkeley”, which examined whether or not there was a bias against women in UC Berkeley’s 1973 graduate admissions. A preliminary look at the data (Table 7.3) suggested that men were significantly more likely to be accepted to UC Berkely than women:
Table 7.3: Overall admissions data from Berkeley’s graduate programs in 1973. Gender with higher admissions percentage highlighted in green, and higher number of applicants in yellow.
# Applicants
% Admitted
# Applicants
% Admitted
A more detailed look across all \(85\) departments painted a different picture. Researchers found that only \(4\) of the departments were significantly biased against women, while \(6\) were significantly biased against men. How can this be?
Let’s take a closer look at admissions data (Table 7.4) from the six largest departments:
Table 7.4: Admissions data from Berkeley’s top six most popular graduate programs (anonymized) in 1973. Gender with higher admissions percentage highlighted in green, and higher number of applicants in yellow. The two most applied to programs for each gender are bolded.
# Applicants
% Admitted
# Applicants
% Admitted
Looking at the six largest departments, we see that women actually have a higher acceptance rate in four of the six departments. Yet, their overall acceptance rate across the departments is still significantly lower than the men’s.
It turns out this is because women more frequently applied to departments with overall lower admissions rates than men did. To better understand what’s going on let’s consider the following hypothetical admissions data, where we imagine Berkeley only has two departments, one that’s easy to get into and one that’s difficult:
Table 7.5: Same set-up as Table 7.4 but only the most applied to department for each gender is bolded.
# Applicants
% Admitted
# Applicants
% Admitted
\(\textbf{15} \%\)
In each department, women have a much higher acceptance rate than men. But, we’ll see that because a much higher proportion of men applied to the department that’s easier to get into, a larger proportion of men will be accepted overall. Indeed the Law of Total Probability (Theorem 7.1) tells us that if we randomly select a male applicant (each man is equally likely to be selected) then,
P(\text{accepted}) &= P(\text{applied to A'})P(\text{accepted}| \text{applied to A'}) &\\
&\qquad + P(\text{applied to B'})P(\text{accepted}| \text{applied to B'}) & \text{(LoTP)}\\
&= 0.9 \cdot 0.8 + 0.1 \cdot 0.05 & \text{(equally likely outcomes)}\\
&= 0.725 & \text{(simplifiy)}
Whereas if we randomly select a female applicant (each woman is equally likely to be selected) then,
P(\text{accepted}) &= P(\text{applied to A'})P(\text{accepted}| \text{applied to A'}) &\\
&\qquad + P(\text{applied to B'})P(\text{accepted}| \text{applied to B'}) & \text{(LoTP)}\\
&= 0.1 \cdot 0.9 + 0.9 \cdot 0.15 & \text{(equally likely outcomes)}\\
&= 0.225 & \text{(simplifiy)}
Looking at the total probability computation, it’s clear why the women have an overall lower probability of acceptance despite having higher acceptance rates in each department. Because a larger proportion of women applied to the department that’s harder to get into, when we compute the probability of a woman getting accepted we always multiply a big probability with a small one. In contrast, when we compute the probability of a man being accepted we multiply two big probabilities together.
In our last example, we see how applying the Law of Total Probability to conditional probability functions also helps us compute conditional probabilities. This example will also hopefully clarify any lingering confusion you may have from the boy-girl paradox (Example 5.3) we presented in Chapter 5.
Example 7.6 (Boy-girl paradox and law of total probability) Recall the scenario from Example 5.3 where we met a family at a dinner party who has two children, and we wanted to know the probability that they had two girls. We considered three scenarios
Scenario One: We learn that at least one of the children is a girl (e.g., “We have two children. One of them is on the women’s swim team.”)
Scenario Two: We learn that the eldest child is a girl (e.g., “We have two children. Our eldest is on women’s swim team.”)
Scenario Three: We meet a random one of the two children and learn that they are a girl (e.g., “One of our children happens to be walking over right now. She’s on the women’s swim team.”)
In each of these scenarios we computed the conditional probability that the couple had two girls to be
Scenario One Solution: 1/3
Scenario Two: 1/2
Scenario Three: 1/2
and felt it was a bit surprising that the third scenario was more similar to the second than the first. We’ll use the law of total probability to further investigate why this is the case.
In Scenario Three, we want to compute \(P(2 \text{ girls} | \text{meet girl} )\). Thinking about meeting a random child is tricky, and it’d perhaps be easier to further condition on whether we meet the eldest or youngest child. Because these two events partition the sample space, we can apply the Law of Total Probability to \(P(\cdot| \text{meet girl})\):
The probability \(P(2 \text{ girls}| \text{meet girl}, \text{meet eldest})\) is essentially exactly what we wanted to compute in Scenario Two. We meet a random child who is a girl, but we also know we meet the eldest child, so we’ve learned exactly that the eldest is a girl. If you believe the answer in Scenario Two, it should not suprise you that
Now it should be more clear why Scenario Three is like Scenario Two. The probability of there being two girls when you learn that a random child is a girl is a mixture of the probability of there being two girls when you learn the eldest is a girl and the probability of there being two girls when you learn the youngest is a girl. Because both these later probabilities are \(1/2\), the former probability must be \(1/2\) as well.
7.2 Bayes’ Rule
In many applications, we know \(P(A | B)\) but want to know \(P(B | A)\). Bayes’ Rule is a tool for inverting conditional probabilities.
Theorem 7.2 (Bayes’ Rule) Let \(A\) and \(B\) be events with positive probabilities. Then: \[ P(B | A) = \frac{P(B) P(A | B)}{P(A)}. \tag{7.2}\]
In many cases, we do not know \(P(A)\) and need to calculate it using the Law of Total Probability (Theorem 7.1), in which case Equation 7.2 becomes:
P(B | A) = \frac{P(B) P(A | B)}{P(B) P(A | B) + P(B^c) P(A | B^c)}.
Bayes’ Rule follows immediately from the two different ways of expanding \(P(A \cap B)\) using the multiplication rule (Corollary 5.1):
P(B) P(A | B) = P(A) P(B | A).
Now, divide both sides by \(P(A)\) to obtain Equation 7.2.
Example 7.7 (Two spades and Bayes’ rule) Revisiting the two spades example (Example 7.1), it is clear that \(P(S_2 | S_1) = \frac{12}{51}\), since, after the top spade is dealt \(12\) of the remaining \(51\) cards in the deck are spades.
But what about \(P(S_1 | S_2)\)? What if we observed that the second card is a spade without observing the first card? Since every card is equally likely to be in any position, this probability must also be \(\frac{12}{51}\) by symmetry. We can use Bayes’ Rule (Theorem 7.2) to confirm:
In our next example, we use Bayes’ rule to calculate the probability that someone who tested positive actually has the disease.
Example 7.8 (Positive COVID-19 test) In Example 6.8, we considered a randomly selected person from New York City (NYC) taking a COVID-19 antigen test in March 2020. If this test comes back positive, what’s the probability that they have COVID-19?
Let \(T\) be the event that the test comes back positive and \(I\) be the event that the person is actually infected with COVID-19. To find \(P(I | T)\), the quantity of interest, we need information about the quality of the test and the base rate of COVID-19 in NYC:
COVID-19 base rate: In March 2020, the base COVID-19 rate in NYC was \(P(I) = .0005\). By the complement rule, \(P(I^c) = .9995\).
Test’s false positive rate: If a person does not have COVID-19, then the test will come back positive \(1\%\) of the time. So \(P(T|I^c) = .01\).
Test’s false negative rate: If a person does have COVID-19, then the test will come back negative \(20\%\) of the time. So, \(P(T^c | I) = .20\). By the complement rule, \(P(T | I) = 1 - P(T^c | I) = .80\).
Since we know \(P(T| I)\), the natural way to calculate \(P(I | T)\) is to use Bayes’ rule. Since we do not know the denominator, \(P(T)\), we need the expanded version of Bayes’ rule (Equation 7.3):
There is less than a \(4\%\) chance that the person actually has COVID-19, even though they just tested positive!
Why is this probability so low?!
How could a test with relatively low error rates of \(1\%\) and \(20\%\) be so inaccurate, that a person who tests positive only has a \(4\%\) chance of being infected?
It is because the rate of COVID in the population is so low (\(P(I) = 0.05\%\)), that \(1\%\) of the \(99.95\%\) of people who don’t have COVID is much greater than \(80\%\) of the \(0.05\%\) of people who do have COVID.
The idea is illustrated in the figure below (not drawn to scale).
The sliver on the left represents the \(P(I) = 0.05\%\) of the population who are infected.
The rectangle on the right represents the \(P(I^c) = 99.95\%\) of the population who are not infected.
The shaded area represents the people who test positive. The shaded area covers most of the sliver on the left and only 1% of the rectangle on the right, but the rectangle is so much larger than the sliver that most of the people who test positive come from the rectangle, which represents the people who were not infected.
In other words, \(P(I | T)\) is the fraction of the shaded area that comes from the sliver on the left. This is only a small fraction of the total shaded area because the base rate of COVID-19 is so small.
Here is another way to think about the problem:
Before the person took the test, their probability of having COVID-19 was \(P(I)\).
After the person tested positive, their probability of having COVID-19 was updated to \(P(I | T)\).
Certainly, a positive test increases this probability, but if \(P(I)\) was very small to begin with, then \(P(I | T)\) may still be small. We cannot ignore the base rate \(P(I)\) when reasoning about \(P(I | T)\); to do so is to commit the base rate fallacy.
When it comes to medical diagnoses, it is a good idea to always seek a second opinion. Let’s suppose the person from Example 7.8 takes a second COVID-19 antigen test.
Example 7.9 (Two positive COVID-19 tests) If this second test also comes back positive, what is the probability that they are infected with COVID-19 now?
Let \(T_1\) be the event of the first test being positive and \(T_2\) as the event that the second test is positive. As in Example 6.8, we will assume that \(T_1\) and \(T_2\) are conditionally independent given \(I\) and \(I^c\).
First, we calculate \(P(I | T_1, T_2)\) directly using Bayes’ rule.
A second test helps a lot here; now, the person knows that they are quite likely to have COVID.
There is another way to approach this problem. In Example 7.8, we updated the probability of infection from \(P(I) = .0005\) to \(P(I | T_1) \approx .03848\) after the first positive test. We can use \(.03848\) as the new “prior” probability in Bayes’ rule to calculate \(P(I | T_1, T_2)\).
Indeed, we can apply Bayes’ rule to the conditional probability function \(P(\cdot | T_1)\) to obtain the same answer:
We get the same answer, whether we apply Bayes’ rule to all the information at once (\(T_1, T_2\)) or one piece at a time (\(T_1\), then \(T_1, T_2\)). This demonstrates that Bayes’ rule is a logically coherent way of summarizing evidence.
As the person takes more tests, the probability of infection can continue to be updated in the same way, by applying Bayes’ rule after each test. The code below calculates this probability as a function of:
the prior probability \(P(I)\) of the person being infected,