7  Law of Total Probability and Bayes Rule

This chapter covers the Law of Total Probability and Bayes Rule, two computational tools that use conditional probabilities. As the examples in this chapter will show, aside from being essential for computing probabilities, these tools elucidate probability in everyday life.

7.1 Law of Total Probability

In the 2021 NCAA women’s basketball tournament, Stanford defeated South Carolina to advance to the championship, where they would play the winner of the other semifinal between Arizona and UConn. If you were asked at this moment (after Stanford won its semifinal, but before Arizona and UConn’s semifinal) for the probability that Stanford wins the championship, how would you determine it?

It is most natural to think about the probability that Stanford will beat a particular opponent. Perhaps Stanford matches up well against UConn and has an \(85\%\) chance of winning if UConn advances to the championship game. But Arizona is a tougher opponent, and Stanford only has a \(60\%\) chance of winning if Arizona advances.

These are probabilities of Stanford winning conditional on either UConn or Arizona making the championship. The final answer, the unconditional probability of Stanford winning the tournament, should also account for the uncertainty in whether Stanford will play UConn or Arizona. If UConn is very likely to beat Arizona, then the probability should be close to \(85\%\). On the other hand, if Arizona is more likely to beat UConn, then the probability should be closer to \(60\%\).

The Law of Total Probability (LoTP) allows us to compute an unconditional probability from the conditional probabilities. It says that we should take a weighted average of the conditional probabilities that Stanford wins against a specific opponent, where the weights are the probabilities that they play each opponent:

\[\begin{align*} P(\text{Stanford wins}) &= P(\text{UConn advances})P(\text{Stanford wins} | \text{UConn advances}) \\ &\qquad + P(\text{Arizona advances})P(\text{Stanford wins} | \text{Arizona advances}) \end{align*}\]

To be concrete, suppose that there is a 90% chance that UConn beats Arizona. Then, the Law of Total Probability says that the probability that Stanford wins the championship is

\[ P(\text{Stanford wins}) = .90 \cdot .85 + .10 \cdot .60 = .825,\] which is very close to the conditional probability that Stanford beats UConn.

On the other hand, Arizona has a \(3/4\) chance of beating UConn, then the Law of Total Probability says that the probability that Stanford wins the championship is

\[ P(\text{Stanford wins}) = 1/4 \cdot .85 + 3/4 \cdot .60 = .6625,\] which is closer to 60%.

Prior to formally stating and proving the Law of Total Probability, we recall that a partition of the sample space \(\Omega\) is a disjoint collection of sets whose union makes up the whole sample space (see Figure 7.1 below).

Figure 7.1: Three events \(B_1\), \(B_2\), and \(B_3\) that partition the sample space \(\Omega\).

The Law of Total Probability tells us that, if \(B_1, B_2 \dots\) partition the sample space, then the probability \(P(A)\) of any event \(A\) is the weighted average of the conditional probabilities \(P(A|B_i)\) of \(A\) given \(B_i\), where the weights are the probabilities \(P(B_i)\) of each \(B_i\) happening.

Theorem 7.1 (Law of Total Probability) Consider a collection of positive probability events \(B_1, B_2, ...\) that partition the sample space \(\Omega\). Then, for any event \(A\), \[ P(A) = \sum_i P(B_i) P(A | B_i) \tag{7.1}\]

Proof

The sets \(A \cap B_i\) are mutually exclusive, and their union is \(A\). Therefore, by Axiom 3 of Definition 3.1,

\[ P(A) = \sum_{i=1}^n P(A \cap B_i). \]

But by the multiplication rule (Theorem 5.1), each term inside the sum can be expanded as

\[ P(A \cap B_i) = P(B_i) P(A | B_i). \]

Substituting this into the sum above, we obtain Equation 7.1: \[ P(A) = \sum_{i=1}^n P(B_i) P(A | B_i). \]

The Law of Total Probability is most useful when the conditional probability of \(A\) is known under every possible circumstance.

Example 7.1 (Probability the second card is a spade) Recall the two spades example from Example 5.5. It is clear that \(P(S_1)\), the probability that the first card is a spade, is \(13/52\). But what is the probability the second card is a spade?

This probability is easy to determine if we know whether or not the first card was a spade.

  • If the first card was a spade, then \(P(S_2 | S_1) = 12/51\).
  • If the first card was not a spade, then \(P(S_2 | S_1^c) = 13/51\).

But what is the unconditional probability the second card is a spade, \(P(S_2)\), if we have no knowledge about the first card?

The answer is \(P(S_2) = 13/52\). The answer must be this because every card is equally likely to be anywhere in the deck, so there is no reason the second card is any more or less likely to be a spade than the first card. This is called a symmetry argument.

But in case you are not convinced by symmetry, here is a calculation using the Law of Total Probability. Note that \(S_1\) and \(S_1^c\) are a partition of the sample space. Therefore, by Equation 7.1:

\[ \begin{aligned} P(S_2) &= P(S_1) P(S_2 | S_1) + P(S_1^c) P(S_2 | S_1^c) \\ &= \frac{13}{52} \cdot \frac{12}{51} + \frac{39}{52} \cdot \frac{13}{51} \\ &= \frac{13}{52} \end{aligned} \]

One way to understand what the Law of Total Probability is doing is to sketch a probability tree. It calculates the total probability of all paths that lead to \(S_2\) (as opposed to \(S_2^c\)).

We will not draw probability trees for the remaining examples, but they can be useful when applying the Law of Total Probability to a partition with only a few events.

In our next example, we use the Law of Total Probability to calculate the probability of winning a pass-line bet in craps. It illustrates the most common way to use the Law of Total Probability: conditioning on what you wish you knew.

Example 7.2 (Winning a pass-line bet) What’s the probability of winning a pass-line bet in craps (see relevant part of Chapter 1 for a review of craps)?

If we knew the come-out roll, then we’d already be done. We previously computed the probability of winning a pass-line bet conditional on each different come-out roll in Example 6.7. The table below recalls these values, and also gives the unconditional probability of each come out roll (you can compute these by looking at Figure 1.7).

\(i\) \(P(\text{come-out roll is } i)\) \(P(\text{win} | \text{come-out roll is } i)\)
2 \(\frac{1}{36}\) \(0\)
3 \(\frac{2}{36}\) \(0\)
4 \(\frac{3}{36}\) \(\frac{3}{9}\)
5 \(\frac{4}{36}\) \(\frac{4}{10}\)
6 \(\frac{5}{36}\) \(\frac{5}{11}\)
7 \(\frac{6}{36}\) \(1\)
8 \(\frac{5}{36}\) \(\frac{5}{11}\)
9 \(\frac{4}{36}\) \(\frac{4}{10}\)
10 \(\frac{3}{36}\) \(\frac{3}{9}\)
11 \(\frac{2}{36}\) \(1\)
12 \(\frac{1}{36}\) \(0\)

Because the different come-out rolls partition the sample space, we can apply the Law of Total Probability: \[ P(\text{win}) = \sum_{i=2}^{12} P(\text{come-out roll is } i) P(\text{win} | \text{come-out roll is } i). \]

Somewhat magically, the Law of Total Probability allows us to condition on the come-out roll (what we wish we knew) despite us not yet knowing what it is!

This sum is most easily evaluated by a computer.

The probability is about \(49.29\%\). Therefore, the casino has a small but definite house edge!

In the next example we compute the probability that a COVID-19 antigen test gives the correct reading. Again, the Law of Total Probability allows us to condition on what we wish we knew.

The Law of Total Probability also comes in handy when we analyze a random process that repeats itself or starts over. By conditioning on the different things that can happen until it restarts, we can obtain a recursive formula that allows us to calculate a probability of interest.

Example 7.3 (Branching Process) Amoebas are a single-celled organisms that reproduce asexually by dividing into two cells. Suppose the world starts with just one amoeba. At the end of one minute the amoeba either dies, stays alive, or splits into two. This process then continues with the remaining amoebas. If the three possibilities are always equally likely and each amoeba behaves independently of the other amoebas and its previous self, what’s the probability that the amoeba population will totally die out?

Let’s call \(p\) the probability of the event \(E\) that that entire amoeba population totally dies out. After one “turn” of this process, three events could have happened:

  1. D(ie): The first amoeba dies and the population has died out.

  2. L(ive): The first amoeba survives and process starts over.

  3. S(plit): The first amoeba splits into two children, and now two new versions of the process start from the beginning.

Because these three events partition the sample space, we can apply the Law of Total Probability in hopes that we can get an expression for \(p\) in terms of itself:

\[\begin{align*} p &= P(E)\\ &= P(D)P(E|D) + P(L)P(E|L) + P(S)P(E|S) & \text{(LoTP)}\\ &= \frac{1}{3}P(E|D) + \frac{1}{3}P(E|L) + \frac{1}{3}P(E|S) & \text{(plug-in probabilities)}\\ \end{align*}\]

Let’s examine each one of these conditional probabilities separately:

  1. \(P(E|D)\): Given that the first amoeba dies on the first turn, the population has surely died out. This conditional probability is \(1\).

  2. \(P(E|L)\): When the first amoeba lives, the proces starts over. Because the first amoeba behaves independently of its previous self, there’s truly no difference between this process and the original process we started with, so the conditional probability that the whole population goes extinct should still be \(p\).

  3. \(P(E|S)\): When the parent ameoba splits into two children, we now have two versions of the process that essentially start from the beginning. Let \(E_1\) be the event that the first child’s lineage dies out, and \(E_2\) be the same for the second child. Conditional on \(S\), the whole population goes extinct if and only if both lineages die out. Also conditional on \(S\), the two children behave independently of each other (as will their children, so on and so forth) so the lineages dying out are conditionally independent events given \(S\). Therefore, \[\begin{align*} P(E|S) &= P(E_1 \cap E_2 | S) & \text{(same event conditional on } S \text{)}\\ &= P(E_1 | S)P(E_2 | S) & \text{(conditional independence)}\\ &= p^2. & \text{(same reasoning as } P(E|L) \text{)} \end{align*}\]

Plugging these values back into our earlier Law of Total Probability computation, we find that \[p = \frac{1}{3} + \frac{1}{3}p + \frac{1}{3}p^2,\] so the solution \(p\) must satisfy the polynomial equation \[\frac{1}{3}p^2 - \frac{2}{3}p + \frac{1}{3} = 0.\] This equation has two solutions, \(p = -2\) and \(p=1\). We know that \(p\) must be between zero and one because it is a probability, and therefore \(p=1\) must be the correct answer. The amoeba population will die out with probability one!

We ask you to explore what happens when we use different probabilities of dying, staying alive, or splitting into two in Exercise 7.1.

Our computation of \(P(E|L)\) (and therefore also \(P(E|S)\)) was a little more informal than usual. We didn’t apply the definition of conditional probability (Definition 5.1) like we typically do, and instead said that \(P(E|L)\) should equal \(P(E)\) based on intuitive arguments. In such recursive problems, justifying things more formally is tricky and beyond the scope of this book. Arguments like the one we presented are sufficient for our sake, and they’ll get you the right answer!

In our last example, we see how applying the Law of Total Probability to conditional probability functions also helps us compute conditional probabilities. This example will also hopefully clarify any lingering confusion you may have from the boy-girl paradox (Example 5.3) we presented in Chapter 5.

Example 7.4 (Boy-girl paradox and law of total probability) Recall the scenario from Example 5.3 where we met a family at a dinner party who has two children, and we wanted to know the probability that they had two girls. We considered three scenarios

  1. Scenario One: We learn that at least one of the children is a girl (e.g., “We have two children. One of them is on the women’s swim team.”)

  2. Scenario Two: We learn that the eldest child is a girl (e.g., “We have two children. Our eldest is on women’s swim team.”)

  3. Scenario Three: We meet a random one of the two children and learn that they are a girl (e.g., “One of our children happens to be walking over right now. She’s on the women’s swim team.”)

In each of these scenarios we computed the conditional probability that the couple had two girls to be

  1. Scenario One Solution: 1/3

  2. Scenario Two: 1/2

  3. Scenario Three: 1/2

and felt it was a bit surprising that the third scenario was more similar to the second than the first. We’ll use the law of total probability to further investigate why this is the case.

In Scenario Three, we want to compute \(P(2 \text{ girls} | \text{meet girl} )\). Thinking about meeting a random child is tricky, and it’d perhaps be easier to further condition on whether we meet the eldest or youngest child. Because these two events partition the sample space, we can apply the Law of Total Probability to \(P(\cdot| \text{meet girl})\):

\[\begin{align*} P(2 \text{ girls}| \text{meet girl}) &= P(2 \text{ girls}| \text{meet girl}, \text{meet eldest})P(\text{meet eldest} | \text{meet girl})\\ &\qquad + P(2 \text{ girls}| \text{meet girl}, \text{meet youngest})P(\text{meet youngest} | \text{meet girl}) & \end{align*}\]

The probability \(P(2 \text{ girls}| \text{meet girl}, \text{meet eldest})\) is essentially exactly what we wanted to compute in Scenario Two. We meet a random child who is a girl, but we also know we meet the eldest child, so we’ve learned exactly that the eldest is a girl. If you believe the answer in Scenario Two, it should not suprise you that

\[ P(2 \text{ girls}| \text{meet girl}, \text{meet eldest}) = \frac{1}{2}. \]

\[\begin{align*} &P(2 \text{ girls}| \text{meet girl}, \text{meet eldest}) & \\ &= \frac{P(2 \text{ girls}, \text{meet girl}, \text{meet eldest})}{\text{meet girl}, \text{meet eldest}} & \text{(def. of conditional probability)}\\ &= \frac{P(\{\boldsymbol{G}G \})}{P(\{\boldsymbol{G}G, \boldsymbol{G}B \})} & \text{(write out outcomes in each event)}\\ &= \frac{1}{2} & \text{(equally likely outcomes)} \end{align*}\]

Note how similar this computation looks to the one we did for Scenario Two in Example 5.3.

We can make the identical argument for the meeting the youngest, so by symmetry we should have

\[ P(2 \text{ girls}| \text{meet girl}, \text{meet youngest}) = \frac{1}{2}. \]

Plugging these values in we find that

\[\begin{align*} &P(2 \text{ girls}| \text{meet girl}) \\ &= \frac{1}{2} P(\text{meet eldest} | \text{meet girl}) + \frac{1}{2} P(\text{meet youngest} | \text{meet girl}) & \text{(LoTP + plug in probabilities)}\\ &= \frac{1}{2} (P(\text{meet eldest} | \text{meet girl}) + P(\text{meet youngest} | \text{meet girl})) & \text{(factor)}\\ &= \frac{1}{2} & \text{(complement rule)} \end{align*}\]

Now it should be more clear why Scenario Three is like Scenario Two. The probability of there being two girls when you learn that a random child is a girl is a mixture of the probability of there being two girls when you learn the eldest is a girl and the probability of there being two girls when you learn the youngest is a girl. Because both these later probabilities are \(1/2\), the former probability must be \(1/2\) as well.

7.2 Bayes’ Rule

In many applications, we know \(P(A | B)\) but want to know \(P(B | A)\). Bayes’ Rule is a tool for inverting conditional probabilities.

Theorem 7.2 (Bayes’ Rule) Let \(A\) and \(B\) be events with positive probabilities. Then: \[ P(B | A) = \frac{P(B) P(A | B)}{P(A)}. \tag{7.2}\]

In many cases, we do not know \(P(A)\) and need to calculate it using the Law of Total Probability (Theorem 7.1), in which case Equation 7.2 becomes:

\[ P(B | A) = \frac{P(B) P(A | B)}{P(B) P(A | B) + P(B^c) P(A | B^c)}. \tag{7.3}\]

Proof

Bayes’ Rule follows immediately from the two different ways of expanding \(P(A \cap B)\) using the multiplication rule (Corollary 5.1):

\[ P(B) P(A | B) = P(A) P(B | A). \]

Now, divide both sides by \(P(A)\) to obtain Equation 7.2.

Example 7.5 (Two spades and Bayes’ rule) Revisiting the two spades example (Example 7.1), it is clear that \(P(S_2 | S_1) = \frac{12}{51}\), since, after the top spade is dealt \(12\) of the remaining \(51\) cards in the deck are spades.

But what about \(P(S_1 | S_2)\)? What if we observed that the second card is a spade without observing the first card? Since every card is equally likely to be in any position, this probability must also be \(\frac{12}{51}\) by symmetry. We can use Bayes’ Rule (Theorem 7.2) to confirm:

\[ P(S_1 | S_2) = \frac{P(S_1) P(S_2 | S_1)}{ P(S_2) } = \frac{13/52 \cdot 12/51}{13/52} = \frac{12}{51}. \]

In our next example, we use Bayes’ rule to calculate the probability that someone who tested positive actually has the disease.

Example 7.6 (Positive COVID-19 test) In Example 6.8, we considered a randomly selected person from New York City (NYC) taking a COVID-19 antigen test in March 2020. If this test comes back positive, what’s the probability that they have COVID-19?

Let \(T\) be the event that the test comes back positive and \(I\) be the event that the person is actually infected with COVID-19. To find \(P(I | T)\), the quantity of interest, we need information about the quality of the test and the base rate of COVID-19 in NYC:

  1. COVID-19 base rate: In March 2020, the base COVID-19 rate in NYC was \(P(I) = .0005\). By the complement rule, \(P(I^c) = .9995\).

  2. Test’s false positive rate: If a person does not have COVID-19, then the test will come back positive \(1\%\) of the time. So \(P(T|I^c) = .01\).

  3. Test’s false negative rate: If a person does have COVID-19, then the test will come back negative \(20\%\) of the time. So, \(P(T^c | I) = .20\). By the complement rule, \(P(T | I) = 1 - P(T^c | I) = .80\).

Since we know \(P(T| I)\), the natural way to calculate \(P(I | T)\) is to use Bayes’ rule. Since we do not know the denominator, \(P(T)\), we need the expanded version of Bayes’ rule (Equation 7.3):

\[\begin{align*} P(I| T) &= \frac{P(I) P(T|I)}{P(I) P(T | I) + P(I^c) P(T| I^c)} & \text{(Bayes' rule)}\\ &= \frac{.0005 \cdot .8}{.0005 \cdot .8 + (1-.0005) \cdot .01} & \text{(plug in probabilities)}\\ &\approx .03848 & \text{(simplify)} \end{align*}\]

There is less than a \(4\%\) chance that the person actually has COVID-19, even though they just tested positive!

Why is this probability so low?!

How could a test with relatively low error rates of \(1\%\) and \(20\%\) be so inaccurate, that a person who tests positive only has a \(4\%\) chance of being infected?

It is because the rate of COVID in the population is so low (\(P(I) = 0.05\%\)), that \(1\%\) of the \(99.95\%\) of people who don’t have COVID is much greater than \(80\%\) of the \(0.05\%\) of people who do have COVID.

The idea is illustrated in the figure below (not drawn to scale).

  • The sliver on the left represents the \(P(I) = 0.05\%\) of the population who are infected.
  • The rectangle on the right represents the \(P(I^c) = 99.95\%\) of the population who are not infected.

The shaded area represents the people who test positive. The shaded area covers most of the sliver on the left and only 1% of the rectangle on the right, but the rectangle is so much larger than the sliver that most of the people who test positive come from the rectangle, which represents the people who were not infected.

In other words, \(P(I | T)\) is the fraction of the shaded area that comes from the sliver on the left. This is only a small fraction of the total shaded area because the base rate of COVID-19 is so small.

Here is another way to think about the problem:

  • Before the person took the test, their probability of having COVID-19 was \(P(I)\).
  • After the person tested positive, their probability of having COVID-19 was updated to \(P(I | T)\).

Certainly, a positive test increases this probability, but if \(P(I)\) was very small to begin with, then \(P(I | T)\) may still be small. We cannot ignore the base rate \(P(I)\) when reasoning about \(P(I | T)\); to do so is to commit the base rate fallacy.

When it comes to medical diagnoses, it is a good idea to always seek a second opinion. Let’s suppose the person from Example 7.6 takes a second COVID-19 antigen test.

Example 7.7 (Two positive COVID-19 tests) If this second test also comes back positive, what is the probability that they are infected with COVID-19 now?

Let \(T_1\) be the event of the first test being positive and \(T_2\) as the event that the second test is positive. As in Example 6.8, we will assume that \(T_1\) and \(T_2\) are conditionally independent given \(I\) and \(I^c\).

First, we calculate \(P(I | T_1, T_2)\) directly using Bayes’ rule.

\[\begin{align*} P(I| T_1, T_2 ) &= \frac{P(I) P(T_1, T_2 | I)}{P(I) P(T_1, T_2 | I) + P(I^c) P(T_1, T_2 | I^c)} & \text{(Bayes' rule)} \\ &= \frac{P(I) P(T_1 |I) P( T_2 | I)}{P(I) P(T_1 | I) P(T_2 | I) + P(I^c) P(T_1 | I^c) P( T_2 | I^c)} & \text{(conditional independence)} \\ &= \frac{.0005 \cdot .80 \cdot .80}{.0005 \cdot .80 \cdot .80 + .9995 \cdot .01 \cdot .01 } & \text{(plug in probabilities)}\\ &\approx .762 \end{align*}\]

A second test helps a lot here; now, the person knows that they are quite likely to have COVID.

There is another way to approach this problem. In Example 7.6, we updated the probability of infection from \(P(I) = .0005\) to \(P(I | T_1) \approx .03848\) after the first positive test. We can use \(.03848\) as the new “prior” probability in Bayes’ rule to calculate \(P(I | T_1, T_2)\).

Indeed, we can apply Bayes’ rule to the conditional probability function \(P(\cdot | T_1)\) to obtain the same answer:

\[\begin{align*} P(I| T_1, T_2 ) &= \frac{P(I|T_1) P(T_2 |I, T_1)}{P(I|T_1) P(T_2 |I, T_1) + P(I^c|T_1) P(T_2 |I^c, T_1)} & \text{(Bayes' rule)}\\ &= \frac{P(I|T_1) P(T_2 |I)}{P(I|T_1) P(T_2 |I) + P(I^c|T_1) P(T_2 |I^c)} & \text{(conditional independence)}\\ &\approx \frac{.03848 \cdot .8}{.03848 \cdot .8 + (1 - .03848) \cdot .01} & \text{(plug in probabilities)}\\ &\approx .762. \end{align*}\]

We get the same answer, whether we apply Bayes’ rule to all the information at once (\(T_1, T_2\)) or one piece at a time (\(T_1\), then \(T_1, T_2\)). This demonstrates that Bayes’ rule is a logically coherent way of summarizing evidence.

As the person takes more tests, the probability of infection can continue to be updated in the same way, by applying Bayes’ rule after each test. The code below calculates this probability as a function of:

  1. the prior probability \(P(I)\) of the person being infected,
  2. the false positive rate \(P(T_1|I^c)\),
  3. the false negative rate \(P(T_1^c| I)\), and
  4. the number of positive test results.

7.3 Exercises

Exercise 7.1 Coming Soon!