5  Conditional Probability

Conditional probabilities, the subject of this chapter, allow us to accurately quantify uncertainty when we have partial information about the experiment’s outcome. Situations like this come up all the time, and conditioning is one of the most important and fundamental tools in probability theory.

As a motivating example, recall the tricky card game (Example 1.12) from Chapter 1. There are three cards, one red on both sides, one black on both sides, one red on one side and black on the other. We randomly select one of the cards, place it on a table, and see that the side facing up was red. This gives us valuable information about the experiment’s outcome (e.g., the card that is black on both sides was definitely not selected). If someone offers to bet that we drew the card that’s red on both sides, what’d we’d like to know is, given that we’ve observed a red side facing up, what’s the probability that the bottom also red? This is a conditional probability, and we’ll see how to compute it later in this chapter.

5.1 Definition of Conditional Probability

The conditional probability \(P(A|B)\) of \(A\) given \(B\) is the probability that \(A\) has happened given that we know that \(B\) has happened. Knowing that \(B\) has happened gives us important partial information about the experiment’s outcome (i.e., we know it must be in \(B\)). To motivate our definition, consider the updated version of the diagram Figure 4.1 from last chapter, where we’ve added the event \(B\):

Figure 5.1: Illustration of the proportion of \(\Omega\) taken up by \(A\), versus the proportion of \(B\) taken up by the part of \(A\) that’s in \(B\).

Like before, we’ll imagine that the probability \(P(A)\) is the proportion of the sample space taken up by \(A\). If we know, however, that \(B\) has happened, then we should restrict our attention to only outcomes in \(B\). The most natural definition of the conditional probability \(P(A|B)\) is then the proportion of \(B\) taken up by the part of \(A\) that’s in \(B\).

Definition 5.1 gives a definition of conditional probability that formalizes this idea.

Definition 5.1 (Conditional probability) Considering two events \(A\) and \(B\) with \(P(B) > 0\), we define the conditional probability of \(A\) given \(B\) as
\[ P(A | B) = \frac{ P(A \cap B)}{P(B)} \tag{5.1}\]

Events with zero probability

Our definition doesn’t apply when \(B\) has zero probability (i.e., \(P(B) = 0\)) because it would require dividing by \(0\). Intuitively, probability zero events never happen, so there’s no need to define probabilities conditional on them happening.

Definition 5.1 is typically not hard to work with, and in many cases it gives intuitive answers.

Example 5.1 (Fair versus loaded die) Consider rolling a fair die with the equally likely outcomes \(\Omega = \{1, 2, 3, 4, 5, 6\}\). What’s the probability that we rolled is a six given that we rolled an even number?

Intuitively, the knowledge that we rolled an even number restricts us to the three outcomes \(B = \{ 2, 4, 6 \}\). Since these outcomes are equally likely, the answer should be \(1/3\).

Let’s calculate this formally using Definition 5.1.

\[\begin{align*} P(\text{six} | \text{even number}) &= \frac{P(\text{six} \cap \text{even number})}{P(\text{even number})} & \text{(def. of conditional probability)}\\ &= \frac{P(\text{six})}{P(\text{even number})} & \text{(\{six \} } \cap \text{ \{even number\} = \{six\})}\\ &= \frac{1/6}{1/2} & \text{(equally likely outcomes)}\\ &= \frac{1}{3} \end{align*}\]

What if we instead consider the weighted die from Example 3.3? The probabilities of the different outcomes are now

\[\begin{align*} & P(\{1\}) = 0.10, \ \ P(\{2\}) = 0.15 , \ \ P(\{3\}) = 0.15, \\ & P(\{4\}) = 0.15, \ \ P(\{5\}) = 0.15, \ \ P(\{6\}) = 0.30,\\ \end{align*}\]

so it should be more likely that we rolled an even number because we rolled a six. Indeed, recalling from Example 3.3 that \(P(\text{even number}) = 0.6\), the same reasoning yields

\[P(\text{six} | \text{even number}) = \frac{P(\text{six})}{P(\text{even number})} = \frac{0.3}{0.6} = \frac{1}{2}.\]

Example 5.2 (Random babies and conditioning) Consider again the random babies example (Example A.1) from Appendix A, where we randomly return \(n\) babies to their corresponding \(n\) sets of parents. As before, let \(E_i\) be the event that the \(i\)th couple gets their correct baby back and suppose each way of returning the babies is equally likely. Recall we showed in Example 4.3 that the probability the first \(k\) babies are returned correctly is \((n-k)!/n!\).

What are the probabilities that…

  1. the first baby is returned correctly given that the second baby is returned correctly?
  2. the first baby is returned correctly given that the second baby is returned incorrectly?

Also, how do these probabilities compare to the unconditional probability \(P(E_1) = 1/n\) of the first baby being returned correctly?

First we compute the two conditional probabilities:

  1. If the second baby is returned correctly, there’s one less incorrect baby the first couple could get, so we should expect it’s more likely for the first couple to get their baby back. Since all ways of returning babies are equally likely, it’s reasonable to think the first couple is equally likely to receive any of the remaining \(n-1\) babies, so there should now be a \(1/(n-1)\) probability that they’ll get the correct one. Indeed, we can directly compute that

\[\begin{align*} P(E_1 | E_2) &= \frac{P(E_1 \cap E_2)}{P(E_2)} & \text{(def. of conditional probability)} \\ &= \frac{(n-2)!/n!}{(n-1)!/n!} & \text{(earlier result)}\\ &= \frac{1}{n-1}. & \text{(simplify)} \end{align*}\]

  1. If we know that the second baby is returned incorrectly, it’s relatively more likely that the first baby will be returned to the second baby’s parents. Intuitively we should expect it to be less likely that the first couple gets their baby back. Prior to computing the conditional probability, we first establish that \[ P(E_1 \cap E_2^c) = \frac{n-2}{n(n-1)} \] via a counting argument.

To compute this probability, we simply need to count the number of outcomes in \(E_1 \cap E_2^c\). There is a one-to-one correspondence between outcomes in \(E_1 \cap E_2^c\) and taking the actions

  1. return baby \(1\) to its parents,
  2. return one of babies \(3\) to \(n\) to the second set of parents,
  3. return the remaining \(n-2\) babies to the remaining parents.

There is \(1\) way of taking the first action and \(n-2\) ways of taking the second. Action three is exactly like running a random babies experiment with \(n-2\) babies, so there are \((n-2)!\) ways of taking action three (see Example 2.4). The fundamental theorem of counting (Theorem 2.1) now tells us that there are \((n-2)\cdot(n-2)!\) outcomes in \(E_1 \cap E_2^c\). Therefore, \[ P(E_1 \cap E_2^c) = \frac{(n-2)\cdot (n-2)!}{n!} = \frac{n-2}{n(n-1)}. \]

\(\textcolor{white}{0000}\) Using this we can compute

\[\begin{align*} P(E_1 | E_2^c) &= \frac{P(E_1 \cap E_2^c)}{P(E_2^c) }& \text{(def. of conditional probability)} \\ &= \frac{P(E_1 \cap E_2^c)}{1 - P(E_2)} & \text{(complement rule)} \\ &= \frac{ \frac{n-2}{n(n-1)} }{1 - \frac{1}{n}} & \text{(earlier results)}\\ &= \frac{n-2}{(n-1)^2}. & \text{(simplify)} \end{align*}\]

You can use the below code to confirm that our intuition was right. For all values of \(n\),

\[ P(E_1 | E_2) \geq P(E_1) \geq P(E_1 | E_2^c). \]

If you experiment with the code, you’ll see that the three probabilities become more similar as \(n\) grows. As one might expect, as the number of babies and parents increases, the assignment of the second baby affects the assignment of the first baby less and less.

In some cases, like Example 5.3 below, conditional probabilities can be quite counterintuitive. To ensure that we correctly handle cases like Example 5.3, it’s imperative to carefully apply Definition 5.1 when computing conditional probabilities and not just go off intuition alone.

Example 5.3 (The boy-girl paradox) Suppose we meet a random family at a dinner party and we learn in conversation that they have two children. Consider the following two scenarios:

  1. Scenario One: We learn that at least one of the children is a girl (e.g., “We have two children. One of them is on the women’s swim team.”)
  2. Scenario Two: We learn that the eldest child is a girl (e.g., “We have two children. Our eldest is on women’s swim team.”)

In each scenario, what’s the probability that both of the children are girls? In each scenario, it may feel that we’ve learned the same information: at least one child is a girl. In actuality, the scenarios lead to different answers!

Marking the gender of the older child first and the younger child second, there are four possible outcomes \(\Omega = \{BB, BG, GB, GG\}\). We’ll assume for simplicity’s sake that the outcomes are equally likely (disregarding our discussion from Example 3.1) and compute the relevant probability in each scenario.

  1. Solution to Scenario One: To find the solution in Scenario One, we directly apply Definition 5.1:

\[\begin{align*} P( 2 \text{ girls}| \geq 1 \text{ girl}) &= \frac{P( 2 \text{ girls and } \geq 1 \text{ girl}) }{P(\geq 1 \text{ girl}) } & \text{(def. of conditional probability)}\\ &= \frac{P(2 \text{ girls}) }{P(\geq 1 \text{ girl})} & \text{(\{} 2 \text{ girls\}} \cap \text{\{} \geq 1 \text{ girl\}} = \text{\{} 2 \text{ girls\})} \\ &= \frac{P( \{GG \}) }{P(\{GG, GB, BG \} )} & \text{(write out outcomes in each event)}\\ &= \frac{1}{3}. & \text{(equally likely outcomes)} \end{align*}\]

  1. Solution to Scenario Two: We can follow the same reasoning to find the solution in Scenario Two:

\[\begin{align*} P( 2 \text{ girls}| \text{eldest girl}) &= \frac{P( 2 \text{ girls and eldest girl}) }{P(\text{eldest girl}) } & \text{(def. of conditional probability)}\\ &= \frac{P(2 \text{ girls}) }{P(\text{eldest girl})} & \text{(\{} 2 \text{ girls\}} \cap \text{\{eldest girl\}} = \text{\{} 2 \text{ girls\})} \\ &= \frac{P( \{GG \}) }{P(\{GG, GB\} )} & \text{(write out outcomes in each event)}\\ &= \frac{1}{2}. & \text{(equally likely outcomes)} \end{align*}\]

As we suggested, the probabilities aren’t the same. In the first scenario, we know that at least one child is a girl. This leaves two ways that one child could be a boy (\(BG\) and \(GB\)). In the second scenario we learn that a specific child (the eldest) is a girl, leaving just one way for one child to be a boy (\(GB\)).

Now we show something perhaps even more counter-intuitive. Consider the following third scenario:

  1. Scenario Three: We meet a random one of the two children and learn that they are a girl (e.g., “One of our children happens to be walking over right now. She’s on the women’s swim team.”)

Since we meet a random one of the children (not a specific one, like the eldest), this may feel more like Scenario One, where all we learn is that at least one child is a girl. It is in fact more like the second, however, because by meeting the child we still unambiguously nail down the gender of one of the children (albeit, a random one). This is something we were unable to do in Scenario One.

  1. Solution to Scenario Three: In Scenario Three, there’s additional randomness due to the fact that we randomly meet one of the children. To properly analyze this scenario, we need to augment our sample space so that it also captures this randomness. We now use the sample space \[ \Omega = \{\boldsymbol{B}B, \boldsymbol{B}G, \boldsymbol{G}B, \boldsymbol{G}G, B\boldsymbol{B}, B\boldsymbol{G}, G\boldsymbol{B}, G\boldsymbol{G} \} \] where we still specify the gender of the older child first and the younger one second, but now the bolded letter indicates which child we meet. We’ll assume that these outcomes are equally likely (i.e., regardless of the childrens’ genders it’s fifty-fifty whether we meet the youngest or eldest child). With this set-up, we can compute

\[\begin{align*} P(2 \text{ girls}| \text{meet girl}) &= \frac{P(2 \text{ girls and meet girl}) }{P(\text{meet girl})} & \text{(def. of conditional probability)} \\ &=\frac{P( \{\boldsymbol{G}G, G\boldsymbol{G} \}) }{P(\{\boldsymbol{G}B, \boldsymbol{G}G, B\boldsymbol{G}, G\boldsymbol{G} \} )} & \text{(write out outcomes in each event)}\\ &= \frac{1}{2}. & (\text{equally likely outcomes}) \end{align*}\]

Using what we’ve learned from Example 5.3, we can revisit the tricky card game (Example 1.12) we discussed in Chapter 1.

Example 5.4 (Revisiting the tricky card game) Suppose we mix three cards in a hat. One card is black on both sides, one is red on both sides, and one is black on one side and red on the other. We randomly pull out a card and put it on a table. Given that the side facing up is red, what’s the probability that the bottom is also red?

Mimicking our approach from the third scenario in Example 5.3, let’s consider the sample space

\[ \Omega = \{\text{ \textbf{\textcolor{red}{red}}-\textcolor{red}{red}, \textbf{\textcolor{red}{red}}-black, \textbf{black}-black, \textcolor{red}{red}-\textbf{\textcolor{red}{red}}, \textcolor{red}{red}-\textbf{black}, black-\textbf{black}}\} \] where bold indicates the side of the card that ends up facing up. Assuming that all the outcomes are equally likely (i.e., we’re equally likely to pick each of the cards and, for whatever card we pick, either side is equally likely to end up facing upwards), we can compute

\[\begin{align*} &P(\text{bottom red}\ |\ \text{red facing up}) & \\ &= \frac{P(\text{bottom red and red facing up}) }{P(\text{red facing up})} & \text{(def. of conditional probability)} \\ &= \frac{P(\text{\textbf{\textcolor{red}{red}}-\textcolor{red}{red}, \textcolor{red}{red}-\textbf{\textcolor{red}{red}}}) }{P(\{\text{\textbf{\textcolor{red}{red}}-\textcolor{red}{red}, \textcolor{red}{red}-\textbf{\textcolor{red}{red}}, \textbf{\textcolor{red}{red}}-black } \} )} & \text{(write out outcomes in each event)} \\ &= \frac{2}{3} & \text{(equally likely outcomes)} \end{align*}\]

What’s the intuition behind this result? We imagine repeating the game over and over and give a frequentist explanation. A red side is always facing up when we pick the red-red card, while it is only facing up half the time when we pick the red-black card. Because these cards are picked a roughly equal amount of times, we will see a red side twice as often from the red-red card than the red-black card.

Recall the gambler from Example 1.12, who was trying to convince us that it is equally likely for the bottom face to be red or black. We now see that it’s actually twice as likely to be red! Without even considering the gambler’s strategy, it’s now obvious that once we’ve observed a red side facing up, we shouldn’t wager \(\$1\) that the bottom is black to only win \(\$1\).

5.2 The Multiplication Rule

Conditional probabilities come in handy when we want to compute the probability of multiple events happening simultaneously. To build some intuition as to why, let’s start with a simple example. Say we draw the two cards at the top of a shuffled deck of \(52\) playing cards (see Example 1.5 for a refresher on what’s in a deck of playing cards). What’s the probability that both cards are spades?

Because there are \(13\) spades in the deck, the first card should be a spade \(13/52\) of the time. Given that the first card is a spade, there are \(12\) spades remaining in a deck of \(51\) cards, so the second card will also be a spade \(12/51\) of the time. For both cards to be spades, we need both these events to happen. This should happen \(12/51\) of \(13/52\) the time, or in other words, \((13/52) \times (12/51)\) of the time.

This is the idea behind the multiplication rule, which results directly from rearranging terms in the definition of conditional probability (Definition 5.1).

Corollary 5.1 (Multiplication rule for two events) For two events \(A\) and \(B\) with positive probability (i.e., \(P(A) > 0\) and \(P(B) > 0\)), \[ P(A \cap B) = P(B) P(A | B) = P(A) P(B | A). \tag{5.2}\]

Probability zero events

If either \(A\) or \(B\) has zero probability (i.e., \(P(A)=0\) or \(P(B)=0\)), then \(P(A \cap B) = 0\) because \(P(A \cap B) \leq P(A)\) and \(P(A \cap B) \leq P(B)\) by the subset rule (Proposition 4.4).

Proof

The definition of conditional probability (Definition 5.1) tells us that \[P(A|B) = \frac{P(A \cap B)}{P(B)}\] Multiplying both sides by \(B\) we get that $ P(B) P(A |B) = P(A B)$. Applying the same argument to \(P(B|A)\) gives that \(P(A) P(B |A) = P(A \cap B)\).

A nice way to visualize conditional probabilities and apply the multiplication rule is to draw a probability tree.

Example 5.5 (Selecting cards and probability trees) Suppose we draw two cards at the top of a shuffled deck of \(52\) playing cards. What’s the probability that both cards are spades?

Let \(S_1\) be the event that the first card is a spade and \(S_2\) be the event that the second card is a spade. The following probability tree illustrates the relevant possibilities for our two cards:

Written above or below each branch is the probability of the event the branch leads into conditional on all the preceeding events having happened. To best see how to parse the information in the tree, it’s easiest to look at a couple examples:

  1. Path through \(S_1\) and \(S_2\): The path through \(S_1\) and \(S_2\) tells us that \(P(S_1) = 13/52\) and \(P(S_1 | S_2) = 12/51\).
  2. Path through \(S_1^c\) and \(S_2\): The path through \(S_1^c\) and \(S_2\) tells us that \(P(S_1^c) = 39/52\) and \(P(S_2|S_1^c) = 13/51\).

At the end of a path, we write the product of all the probabilities along the path. The multiplication rule (Corollary 5.1) tells us that this final number is the probability of all the events on the path happening simultaneously. For example:

  1. Drawing two spades: To find the probability of drawing two spades, we can follow the path through \(S_1\) and \(S_2\) to find that \(P(S_1 \cap S_2) = \frac{13}{52} \frac{12}{51}\) at the end.

  2. Drawing no spades: Following the path through \(S^c_1\) and \(S^c_2\) we see that \(P(S^c_1 \cap S^c_2) = \frac{39}{52} \frac{38}{51}\).

We don’t make much more use of probability trees in this book, but they are a popular way of representing conditional probabilities and can be helpful for problem solving.

The multiplication rule can also be easily generalized to more than two events. Prior to introducing the generalization however, we first have to introduce some new notation.

When we start dealing with the intersection of many events, it gets cumbersome to write \(P(A_1 \cap \dots \cap A_n)\) for the probability of \(A_1, \dots, A_n\) all happening. Instead, we will adopt the shorthand notation \[ P(A_1, \dots, A_n) = P(A_1 \cap \dots \cap A_n), \tag{5.3}\] where commas inside a probability \(P(\dots)\) are read as an “and”. For example,

  • \(P(A_1, A_2, A_3)\) is the probability of “\(A_1 \and A_2 \and A_3\)”.
  • \(P(A| B_1, \dots, B_n)\) is the probability of \(A\) conditional on “\(B_1 \and B_2 \and \dots \and B_n\)” all having happened.

With this new notation, we can comfortably state a generalized version of the multiplication rule.

Theorem 5.1 (Multiplication rule) For events \(A_1, \dots, A_n\) with positive probability, \[ P(A_1, A_2, \dots A_n) = P(A_1) P(A_2 | A_1) P(A_3 | A_1, A_2) \dots P(A_n| A_1, \dots A_{n-1}) \tag{5.4}\]

The multiplication rule is formally proved by mathematical induction. Rather than give a formal proof, we’ll provide a more illustrative argument. Looking at the right-hand side of Equation 5.4, we have \[ \textcolor{blue}{P(A_1) P(A_2 | A_1)} P(A_3 | A_1, A_2) \dots P(A_n| A_1, \dots, A_{n-1}) \] The multiplication rule for two events (Corollary 5.1) tells us that the highlighted part can be written as
\[ P(A_1) P(A_2| A_1) = P(A_1, A_2). \] Substituting this back in, the right hand side is now \[ \textcolor{blue}{P(A_1, A_2) P(A_3 | A_1, A_2)} \dots P(A_n| A_1, \dots, A_{n-1}) \] Another application of the multiplication rule for two events (the first event is \(A_3\) and the second is \(A_1 \cap A_2\)) tells us that the new highlighted part can be written as

\[ P(A_1, A_2) P(A_3| A_1, A_2) = P(A_1, A_2, A_3). \]

Substituting this back in, the right hand side is now \[ P(A_1, A_2, A_3) P(A_4 | A_1, A_2, A_3) \dots P(A_n| A_1, \dots, A_{n-1}) \]

Continuing this argument, all of these probabilities will eventually combine to form \(P(A_1, \dots, A_n)\).

Ordering of events

Since there’s nothing special about the order of the events, we can apply the multiplication rule to any ordering of them (as was the case in Corollary 5.1). For example, it’s both true that
\[ P(A_1 , A_2, A_3) = P(A_3) P(A_2 | A_3) P(A_1|A_2, A_3), \] and that \[ P(A_1 , A_2, A_3) = P(A_2) P(A_1 | A_2) P(A_3|A_1, A_2). \] Since there are \(n!\) ways the events can be ordered, the multiplication rule is actually \(n!\) theorems for the price of one! You can pick the ordering that is most convenient for evaluating the probabilities.

In certain settings, the multiplication rule enables us to to easily compute the probability of many events happening simultaneously.

Example 5.6 (Probability of a flush) Suppose we select a \(5\)-card poker hand (see the relevant part of Chapter 2 for a review of poker) from a standard deck of playing cards. Imagining that we select the cards like we did in Example 5.5, what’s the probability that we get a flush (five cards of the same suit)?

The event of getting a flush can be written as the disjoint union of getting a flush of spades, a flush of clubs, a flush of diamonds, or a flush of hearts. Therefore, Axiom 3 (see Definition 3.1) tells us that

\[ P(\text{flush})= P(\spadesuit \text{ flush}) + P(\clubsuit \text{ flush}) + P(\diamondsuit \text{ flush}) + P(\heartsuit \text{ flush}). \]

Let’s focus on computing \(P(\spadesuit \text{ flush})\), i.e., the probablity that we draw all spades. The rest of the probabilities should be equal by symmetry (there’s nothing special about spades compared to the other suits). Letting \(S_i\) be the event that the \(i\)th card we draw is a spade, we can apply the multiplication rule (Theorem 5.1): \[\begin{align*} &P(\spadesuit \text{ flush}) &\\ &= P(S_1) P(S_2 | S_1) \dots P(S_5| S_4, S_3, S_2, S_1) & \text{(multiplication rule)}\\ &= \frac{13}{52} \frac{12}{51} \frac{11}{50} \frac{10}{49} \frac{9}{48} & \text{(equally likely to select any remaining card)} \end{align*}\]

Indeed, the same argument applies to every suit, implying that \[ P(\text{flush}) = 4 \times \frac{13 \times 12 \times 11 \times 10 \times 9}{52 \times 51 \times 50 \times 49 \times 48} \approx 0.002. \]

We could have solved this problem by drawing a probability tree as in Example 5.5 and multiplying together probabilities along the appropriate path (this is equivalent to applying the multiplication rule). But, because we drew five cards instead of two, a full probability tree like the one in Example 5.5 would have had \(2^5 = 32\) different paths. Probability trees quickly become unwieldy when there are many events in question! Usually, it’s better to just apply the multiplication rule directly.

5.3 Conditional Probability Functions

Roughly speaking, the job of a probability function \(P\) is to divvy up the total amount “probability” among the different outcomes in the sample space \(\Omega\). Conditioning on \(B\) can be thought up as as re-divvying up this probability to just the outcomes in \(B\). Outcomes outside of \(B\) now get no probability, while outcomes in \(B\) get probability proportional to what they had previously, but rescaled so that the total amount of probability is still one.

Theorem 5.2 formalizes this idea. In particular, it tells us that conditioning on an event exactly corresponds to reusing a new probability function (i.e., re-divvying up probability).

Theorem 5.2 (Conditional probabilities are probability functions) Consider a probability function \(P(\cdot)\) that assigns probability \(P(A)\) to the event \(A\). For any event \(B\) with positive probability \(P(B) > 0\), the conditional probability function \(P(\cdot| B)\) that assigns probability \(P(A|B)\) to the event \(A\) is itself a valid probability function. That is, it satisfies Kolmogorov’s axioms (Definition 3.1).

To show that \(P(\cdot|B)\) is a valid probability function, we check that it satisfies Definition 3.1:

  1. Axiom 1: Fix some event \(A\). Beacuse \(P(\cdot)\) satisfies Axiom 1 we know that \(P(A) \geq 0\) and \(P(A \cap B) > 0\). Therefore, \(P(A|B) = P(A \cap B)/P(B) \geq 0\) as well. Thus, \(P(\cdot | B)\) satisfies Axiom 1.

  2. Axiom 2: Because \(B \subset \Omega\) we know that \(B \cap \Omega = B\) (see Figure 4.3). Therefore \(P(\Omega | B) = P(\Omega \cap B)/P(B) = P(B)/P(B) = 1\). Thus, \(P(\cdot | B)\) satisfies Axiom 2.

  3. Axiom 3: Consider countably many disjoint sets \(A_1, A_2, \dots\). Because \(A_i \cap B \subseteq A_i\), the countable collection of sets \(A_1 \cap B, A_2 \cap B, \dots\) must all also be disjoint (see Exercise A.1). Also, the set \(\left(\bigcup_{i=1}^{\infty} A_i \right) \cap B\) is the set of outcomes that are in at least one \(A_i\) and also in \(B\). This is the same as the set \(\bigcup_{i=1}^{\infty} (A_i \cap B)\) (see Exercise A.1). Therefore

\[\begin{align} P\left(\bigcup_{i=1}^{\infty} A_i \bigg| B \right) &= \frac{P\left( \left(\bigcup_{i=1}^{\infty} A_i \right) \cap B \right)}{P(B)} & \text{(def. of conditional probability)}\\ &= \frac{P( \bigcup_{i=1}^{\infty} (A_i \cap B) )}{P(B)} & \text{(above reasoning)}\\ &= \frac{\sum_{i=1}^{\infty} P(A_i \cap B)}{P(B)} & \text{(} P \text{ satisfies Axiom 3)}\\ &= \sum_{i=1}^{\infty} \frac{P(A_i \cap B)}{P(B)} & \text{(distribute)}\\ &= \sum_{i=1}^{\infty} P(A_i |B) & \text{(def. of conditional probability)} \end{align}\]

The most important consequence of Theorem 5.2 is that it allows us to apply all the properties from Chapter 4 to conditional probabilities. We know immediately, for example, that all conditional probabilities are also between zero and one (see Proposition 4.2). Here are a couple more examples that illustrate how we can apply the properties from Chapter 4 to conditional probabilities.

Example 5.7 (Random babies, conditioning, and complements) Recall the random babies from Example 5.2. What’s the probability that the first baby is returned incorrectly given that the second baby is also returned incorrectly?

We computed that \(P(E_1 | E_2^c) = (n-2)/(n-1)^2\) in Example 5.2 via counting. Computing \(P(E_1^c|E_2^c)\) via counting is also feasible, but trickier (we encourage you to try and see for yourself why!). Instead, we can leverage the fact that \(P(\cdot| E_2^c)\) is a valid probability function (see Theorem 5.2) to use the complement rule (Proposition 4.1): \[ P(E_1^c | E_2^c) = 1 - P(E_1| E_2^c) = 1 - \frac{n-2}{(n-1)^2}. \]

Incorrect application of complement rule to conditional probabilities

When applying the complement rule (or any other properties) to conditional probabilities we need to ensure that the conditioning event is the same. The following is an incorrect application of the complement rule: \[ P(E_1 | E_2) + P(E_1 | E_2^c) = 1. \]

There’s no way we could apply the complement rule (or any other property) because one probability comes from \(P(\cdot | E_2)\) and the other from \(P(\cdot | E_2^c)\), two entirely different probability functions! In fact, our results from Example 5.2 show explicitly that when \(n > 2\), \[ P(E_1 | E_2) + P(E_1 | E_2^c) = \frac{1}{n-1} + \frac{n-2}{(n-1)^2} = \frac{2n - 3}{(n-1)^2} < 1. \]

Example 5.8 (Stanford software engineers with and without computer science degrees) Suppose we randomly select a person from the United States. Given that they’re a Stanford graduate, which of the two events is more likely: (1) the person we select is a software engineer (SWE) or (2) the person we select has a degree in computer science (CS) and is a software engineer?

Many Stanford graduates become software engineers, so the first event seems pretty likely. Many Stanford graduates also hold degrees in computer science (in 2023 it was the most popular major). Because the second event has two qualities that a Stanford graduate is likely to have, you may be fooled into thinking it’s more likely. But still, the second event is a subset of the first (even amongst Stanford graduates, the group of computer science degree holding software engineers is a subset of the group of software engineers). Because \(P(\cdot | \text{ Stanford graduate })\) is a valid probability function (see Theorem 5.2), the subset rule (Proposition 4.4) tells us that

\[ P(\text{SWE and CS degree} | \text{Stanford graduate}) \leq P(\text{SWE}| \text{Stanford graduate}). \]

5.4 Interpreting Conditional Probabilities

To close the chapter, we briefly discuss why frequentists and Bayesians alike find the definition of conditional probability given in Definition 5.1 suitable.

5.4.1 Frequentism

Recall that a frequentist defines the probability \(P(A)\) of an event \(A\) to be the limiting frequency at which \(A\) happens over infinitely many repeated experiments. For a frequentist, the natural definition for the probability \(P(A |B)\) of \(A\) given \(B\) should then be limiting frequency of \(A\) happening, but only among trials where we know \(B\) has happened: \[ P(A |B) = \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } A \text{ and } B \text{ happen} }{\text{\# trials where } B \text{ happens}}. \]

With this interpretation we can justify why a frequentist would agree with our definition of conditional probability.

Considering any events \(A\) and \(B\) where \(P(B) > 0\),

\[\begin{align} &P(A | B) & \\ &= \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } A \text{ and } B \text{ happen} }{\text{\# trials where } B \text{ happens}} & \text{(frequentist interpretation)} \\ &= \lim_{\text{\# trials} \rightarrow \infty} \frac{\frac{\text{\# trials where } A \text{ and } B \text{ happen}}{\text{\# trials }}}{\frac{\text{\# trials where } B \text{ happens}}{\text{\# trials }}} & \text{(algebra)}\\ &= \frac{\lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } A \text{ and } B \text{ happen}}{\text{\# trials }}}{ \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } B \text{ happens}}{\text{\# trials }}} & \text{(property of limits)}\\ &= \frac{P(A \cap B)}{P(B)} & \text{(frequentist interpretation)} \end{align}\]

Note that we could only apply the limit property because \(P(B) > 0\).

If \(P(B) = 0\) then by subset rule (see Proposition 4.4) \(P(A \cap B) \leq P(B) = 0\) as well. Therefore the limit that frequentists use to define conditional probability, which appears to approach \(0/0\), is indeterminate (we can’t determine what it is or if it even exists). Events with zero probability, however, have a limiting frequency of zero and (essentially) never happen. As such, frequentists don’t care to define probabilities conditional on them happening.

5.4.2 Bayesianism

Bayesian probabilities are just beliefs about whether an event will happen or not. As such, a Bayesian should want the probability \(P(A|B)\) of \(A\) given \(B\) to be their updated belief regarding \(A\) happening once they’ve found out with certainty that \(B\) has happened.

As an example, suppose we think there’s an \(80\)% chance that our favorite sports team will win tomorrow, meaning we’d wager \(\$1\) on them winning so long as we profit at least \(\$0.25\) if they do. If we find out right before the game that our team’s best player is injured, should we continue taking wagers at these odds? Unsurprisingly, the answer is no. Given this new information, we need to update our beliefs. And if we don’t update them by the proper mechanism, we’ll be susceptible to arbitrage, much like our friend in Example 3.7.

Example 5.9 (Coming soon!)  

Theorem 5.3, which we state without proof, tells us that to avoid arbitrage we must update our beliefs in concordance with the above definition of conditional probability (Definition 5.1).

Theorem 5.3 (Coherence and conditional probability) Consider a Bayesian whose beliefs are given by the probability function \(P(\cdot)\) and an event \(B\) that they assign positive probability \(P(B) > 0\) to. Any Bayesian who, upon observing that \(B\) happened, doesn’t update their beliefs to be given by the conditional probability function \(P(\cdot | B)\) from Theorem 5.2, is susceptible to arbitrage.

In our sports example, Theorem 5.3 tells us that, to avoid arbitrage, we must update our belief that our favorite team will win to \[ P(\text{win} | \text{injury}) = \frac{P(\text{injury and win})}{P(\text{injury})}, \] where \(P(\text{injury})\) was the belief we held that our best player would be injured and \(P(\text{injury and win})\) was the belief we held that our best player would be injured and our team would still win prior to having observed the injury.

This example illustrates how Bayesians are constantly updating their beliefs in concordance with Definition 5.1 as they observe new information. In this view, Bayesians really consider all probabilities conditional probabilities! Even a Bayesian’s “unconditional” beliefs given by \(P(\cdot)\) are implicitly conditional on all the information they’ve ingested up to that point. In our example, our “prior” belief \(P(\text{win})\) that our favorite team would win is implicitly conditional on the games we’ve seen them win/lose, the margins of victory/loss, the quality of the opposing team, the reported injuries up to this point, and so on.

5.5 Exercises

Coming soon