14  Expectations Involving Multiple Random Variables

In Chapter 13, we saw that the relationship between two random variables can be completely described by the joint PMF. However, the joint PMF is usually more information than we need. Can we summarize the relationship between two random variables with a single number? As we will see in this chapter and Chapter 15, \(\text{E}\!\left[ XY \right]\) contains useful information about the relationship between \(X\) and \(Y\). In order to calculate \(\text{E}\!\left[ XY \right]\), we need a general way to evaluate expectations of the form \(\text{E}\!\left[ g(X, Y) \right]\), discussed in the next section.

14.1 2D LotUS

To calculate expectations of the form \(\text{E}\!\left[ g(X, Y) \right]\), we need the following generalization of LotUS (Theorem 11.1) to two variables.

Theorem 14.1 (2D LotUS) Let \(X\) and \(Y\) be random variables with joint PMF \(f_{X, Y}(x,y)\). Then, for a function \(g: \mathbb{R}^2 \to \mathbb{R}\), \[ \text{E}\!\left[ g(X,Y) \right] = \sum_x \sum_y g(x,y) f_{X, Y}(x,y). \tag{14.1}\]

Similar to Theorem 11.1, we calculate the expectation of \(g(X, Y)\) by weighting the possible values of \(g(x, y)\) by the corresponding probabilities, which are now supplied by the joint PMF (because there are two random variables).

14.2 Expectation of sums

Most traits are polygenic—that is, determined by multiple genes. If \(X\) and \(Y\) represent the numbers of dominant alleles of two genes, and each dominant allele increases the risk of some trait, then \(X + Y\) summarizes a child’s overall risk of the trait. We might want to know their expected risk, \(\text{E}\!\left[ X + Y \right]\). This expectation can be evaluated using 2D LotUS.

Example 14.1 (Expectation of a sum) Let \(X\) and \(Y\) be defined as in Example 14.5. We will evaluate \(\text{E}\!\left[ X + Y \right]\) under the three different models from Chapter 13.

If the two genes are on different chromosomes, then we determined the joint PMF in Example 13.1 to be

\(y \quad \Bigg\\ \quad x\) \(0\) \(1\) \(2\)
\(0\) \(0.0625\) \(0.125\) \(0.0625\)
\(1\) \(0.125\) \(0.25\) \(0.125\)
\(2\) \(0.0625\) \(0.125\) \(0.0625\)

Letting \(g(x, y) = x + y\) in Theorem 14.1, we have \[ \begin{align} \text{E}\!\left[ X + Y \right] &= \sum_x \sum_y (x + y)\cdot f_{X, Y}(x, y) \\ &= \phantom{.} (0 + 0) \cdot 0.0625 + (0 + 1) \cdot 0.125 + (0 + 2) \cdot 0.0625 \\ &\quad + (1 + 0) \cdot 0.125 + (1 + 1) \cdot 0.25 + (1 + 2) \cdot 0.125 \\ &\quad + (2 + 0) \cdot 0.0625 + (2 + 1) \cdot 0.125 + (2 + 2) \cdot 0.0625 \\ &= 2. \end{align} \]

Now suppose the two genes are on the same chromosome, without recombination. We determined the joint PMF in Example 13.2 to be

\(y \quad \Bigg\\ \quad x\) \(0\) \(1\) \(2\)
\(0\) \(0\) \(0\) \(0.25\)
\(1\) \(0\) \(0.5\) \(0\)
\(2\) \(0.25\) \(0\) \(0\)

By 2D LotUS, \[ \begin{align} \text{E}\!\left[ X+ Y \right] &= \sum_x \sum_y (x + y) \cdot f_{X, Y}(x, y) \\ &= (2 + 0) \cdot 0.25 + (1 + 1) \cdot 0.5 + (0 + 2) \cdot 0.25 \\ &= 2. \end{align} \]

Finally, suppose that there is recombination. We determined the joint PMF in Example 13.3 to be

\(y \quad \Bigg\\ \quad x\) \(0\) \(1\) \(2\)
\(0\) \(0.0025\) \(0.045\) \(0.2025\)
\(1\) \(0.045\) \(0.41\) \(0.045\)
\(2\) \(0.2025\) \(0.045\) \(0.0025\)

By 2D LotUS, we have \[ \begin{align} \text{E}\!\left[ X + Y \right] &= \sum_x \sum_y (x+y)\cdot f_{X, Y}(x, y) \\ &= \phantom{.} (0+0) \cdot 0.0025 + (0+1) \cdot 0.045 + (0+2) \cdot 0.2025 \\ &\quad + (1+0) \cdot 0.045 + (1+1) \cdot 0.41 + (1+2) \cdot 0.045 \\ &\quad + (2+0) \cdot 0.2025 + (2+1) \cdot 0.045 + (2+2) \cdot 0.0025 \\ &= 2. \end{align} \]

Notice that \(\text{E}\!\left[ X + Y \right]\) agrees with \(\text{E}\!\left[ X \right] + \text{E}\!\left[ Y \right] = 2\), no matter what the relationship between \(X\) and \(Y\) is. This is not a coincidence, as the next result shows.

Theorem 14.2 (Linearity of Expectation) Let \(X\) and \(Y\) be random variables. Then, \[ \text{E}\!\left[ X + Y \right] = \text{E}\!\left[ X \right] + \text{E}\!\left[ Y \right]. \]

Proof

Using 2D LotUS with \(g(x,y) = x + y\), we see that \[\begin{align*} \text{E}\!\left[ X + Y \right] &= \sum_x \sum_y (x+y) f_{X, Y}(x,y) \\ &= \sum_x \sum_y x f_{X, Y}(x,y) + \sum_x \sum_y y f_{X, Y}(x,y) \\ &= \sum_x x \sum_y f_{X, Y}(x,y) + \sum_y y \sum_x f_{X, Y}(x,y) \\ &= \sum_x x f_X(x) + \sum_y y f_Y(y) \\ &= \text{E}\!\left[ X \right] + \text{E}\!\left[ Y \right]. \end{align*}\]

Theorem 14.2 is more remarkable than it first appears. It says that \(\text{E}\!\left[ X + Y \right]\), which in principle could depend on the joint distribution of \(X\) and \(Y\), can be calculated using only the distributions of \(X\) and \(Y\) individually. That is, no matter how \(X\) and \(Y\) are related to each other, \(\text{E}\!\left[ X + Y \right]\) will have the same value.

Linearity of expectation is an incredibly powerful tool. If we can break down a complicated random variable into a sum of simpler random variables, then we can use linearity of expectation to calculate its expectation, even if we do not know its PMF.

Example 14.2 (Hire and fire) A manager is hiring an assistant. However, even after she has hired an assistant, she still continues to interview candidates, always hoping for someone better. If she interviews a candidate that is better than her current assistant, she fires the current assistant and hires the new candidate. If she interviews \(n\) candidates in total, how many of those candidates will she hire on average?

Assume that the manager is equally likely to interview the \(n\) candidates in any order. Let \(X\) be the number of candidates hired. Although the PMF of \(X\) is hopelessly complicated, we can express \(X\) as a sum of indicator random variables \[ X = I_1 + I_2 + \dots + I_n, \] where \(I_k\) is \(1\) if the \(k\)th candidate is hired.

Now, the \(k\)th candidate will only be hired if and only if they are the best so far among the first \(k\) candidates. (If someone else were better, then they or someone even better would currently have the job, and the \(k\)th candidate would not unseat them.) Since any one of the first \(k\) candidates is equally likely to be the best, the probability that the best is the \(k\)th candidate is \(\frac{1}{k}\), and \[ \text{E}\!\left[ I_k \right] = P(I_k = 1) = \frac{1}{k}. \]

Therefore, \[ \begin{align} \text{E}\!\left[ X \right] = \sum_{k=1}^n \text{E}\!\left[ I_k \right] &= \sum_{k=1}^n \frac{1}{k} \\ &\approx \int_1^n \frac{1}{x}\,dx \\ &= \ln(n). \end{align} \]

On average, she will hire \(\ln(n)\) candidates.

Finally, we use linearity of expectation to provide simple derivations of the binomial and hypergeometric expectations. Compare this next derivation with Proposition 9.2.

Example 14.3 (Binomial expectation using linearity) Let \(X \sim \text{Binomial}(n,p)\). That is, \(X\) is the number of heads in \(n\) tosses of a coin with probability \(p\) of landing heads.

Recall Example 10.6, where we showed that \(X\) could be expressed as \[ X = I_1 + I_2 + \dots + I_n, \] where \(I_1, \dots, I_n\) are independent \(\text{Bernoulli}(p)\) random variables. These Bernoulli random variables are also called indicator random variables because they “indicate” whether an event occurred. In this case, each \(I_k\) “indicates” (i.e., is equal to \(1\)) when the \(k\)th toss is heads.

Now, we can use linearity of expectation to calculate the expectation: \[ \text{E}\!\left[ X \right] = \text{E}\!\left[ I_1 + \cdots + I_n \right] = \text{E}\!\left[ I_1 \right] + \cdots + \text{E}\!\left[ I_n \right] = \underbrace{p + \cdots + p}_{\text{$n$ times}} = np. \]

Not only does linearity simplify the proof, but it also provides insight into why the formula is true.

The same strategy works for the hypergeometric distribution. In fact, the hypergeometric distribution has the same expectation as the binomial, precisely because of linearity of expectation, which holds regardless of the relationship between the random variables.

Example 14.4 (Hypergeometric expectation using linearity) Let \(Y \sim \text{Hypergeometric}(n, M, N)\). That is, \(Y\) is the number of defective cans in \(n\) draws without replacement from a lot of \(N\) cans, of which \(M\) are defective.

Again, we can write \(Y\) as a sum of indicator random variables, \[ Y = I_1 + I_2 + \dots + I_n, \] where \(I_k\) is \(1\) if the \(k\)th can sampled is defective. Each \(I_k\) represents a random draw from the lot, with a probability \(\frac{M}{N}\) of drawing a defective can, so \(I_k \sim \text{Bernoulli}(p=\frac{M}{N})\) and \(\text{E}\!\left[ I_k \right] = \frac{M}{N}\).

By linearity of expectation: \[ \text{E}\!\left[ Y \right] = \text{E}\!\left[ I_1 + \cdots + I_n \right] = \text{E}\!\left[ I_1 \right] + \cdots + \text{E}\!\left[ I_n \right] = \underbrace{\frac{M}{N} + \cdots + \frac{M}{N}}_{\text{$n$ times}} = n\frac{M}{N}. \] Note that these indicator random variables are not independent because the draws are made without replacement. If \(I_1 = 1\), then \(I_2\) is less likely to equal \(1\) because there is one fewer defective can in the lot.

In Example 14.4, suppose the draws had been made with replacement. Then the indicator random variables would be independent, and \(Y\) would be \(\text{Binomial}(n, p=\frac{M}{N})\). But the expectation would still be \[ \text{E}\!\left[ Y \right] = np = n\frac{M}{N}. \] This reinforces the fact that linearity of expectation does not depend on the relationship between the random variables we are adding.

14.3 Expectation of products

We can also use 2D LotUS to evaluate \(\text{E}\!\left[ XY \right]\). Unlike \(X + Y\), the random variable \(XY\) does not have a simple interpretation. However, as we will see in Chapter 15, its expectation, \(\text{E}\!\left[ XY \right]\), captures useful information about the relationship between \(X\) and \(Y\).

Example 14.5 (Expectation of a product) Recall the random variables from the beginning of Chapter 13. A child is born to parents who are carriers for the recessive alleles of two genes (i.e., their genetic makeups are \(AaBb\)). Let \(X\) be the number of copies of the dominant \(A\) allele that the child inherits and \(Y\) be the number of copies of the dominant \(B\) allele.

We will evaluate \(\text{E}\!\left[ XY \right]\) under the three different models from Chapter 13.

If the two genes are on different chromosomes, then \(X\) and \(Y\) are independent. We determined the joint PMF in Example 13.1 to be

\(y \quad \Bigg\\ \quad x\) \(0\) \(1\) \(2\)
\(0\) \(0.0625\) \(0.125\) \(0.0625\)
\(1\) \(0.125\) \(0.25\) \(0.125\)
\(2\) \(0.0625\) \(0.125\) \(0.0625\)

Letting \(g(x, y) = xy\) in Theorem 14.1, we have \[ \begin{align} \text{E}\!\left[ XY \right] &= \sum_x \sum_y xy\cdot f_{X, Y}(x, y) \\ &= \phantom{.} (0)(0) \cdot 0.0625 + (0)(1) \cdot 0.125 + (0)(2) \cdot 0.0625 \\ &\quad + (1)(0) \cdot 0.125 + (1)(1) \cdot 0.25 + (1)(2) \cdot 0.125 \\ &\quad + (2)(0) \cdot 0.0625 + (2)(1) \cdot 0.125 + (2)(2) \cdot 0.0625 \\ &= 1. \end{align} \]

Now suppose the two genes are on the same chromosome, without recombination. We determined the joint PMF in Example 13.2 to be

\(y \quad \Bigg\\ \quad x\) \(0\) \(1\) \(2\)
\(0\) \(0\) \(0\) \(0.25\)
\(1\) \(0\) \(0.5\) \(0\)
\(2\) \(0.25\) \(0\) \(0\)

By 2D LotUS, \[ \begin{align} \text{E}\!\left[ XY \right] &= \sum_x \sum_y xy\cdot f_{X, Y}(x, y) \\ &= (2)(0) \cdot 0.25 + (1)(1) \cdot 0.5 + (0)(2) \cdot 0.25 \\ &= 0.5. \end{align} \]

Finally, suppose that there is recombination. We determined the joint PMF in Example 13.3 to be

\(y \quad \Bigg\\ \quad x\) \(0\) \(1\) \(2\)
\(0\) \(0.0025\) \(0.045\) \(0.2025\)
\(1\) \(0.045\) \(0.41\) \(0.045\)
\(2\) \(0.2025\) \(0.045\) \(0.0025\)

By 2D LotUS, we have \[ \begin{align} \text{E}\!\left[ XY \right] &= \sum_x \sum_y xy\cdot f_{X, Y}(x, y) \\ &= \phantom{.} (0)(0) \cdot 0.0025 + (0)(1) \cdot 0.045 + (0)(2) \cdot 0.2025 \\ &\quad + (1)(0) \cdot 0.045 + (1)(1) \cdot 0.41 + (1)(2) \cdot 0.045 \\ &\quad + (2)(0) \cdot 0.2025 + (2)(1) \cdot 0.045 + (2)(2) \cdot 0.0025 \\ &= 0.6. \end{align} \]

Even though \(X\) and \(Y\) are \(\textrm{Binomial}(n= 2, p= 1/2)\) in all three cases, the value of \(\text{E}\!\left[ XY \right]\) is different in each case. This is because \(\text{E}\!\left[ XY \right]\) captures information about the relationship between the two variables, not just \(X\) and \(Y\) individually.

In particular, notice that \(\text{E}\!\left[ XY \right] \neq \text{E}\!\left[ X \right] \text{E}\!\left[ Y \right]\) in general. However, in Example 14.5, we saw that they are equal when \(X\) and \(Y\) are independent. This is no accident, as the next theorem shows.

Theorem 14.3 (Expectation of a product of independent random variables) If \(X\) and \(Y\) are independent random variables, then \[ \text{E}\!\left[ XY \right] = \text{E}\!\left[ X \right] \text{E}\!\left[ Y \right]. \] Moreover, for functions \(g\) and \(h\), \[ \text{E}\!\left[ g(X) h(Y) \right] = \text{E}\!\left[ g(X) \right] \text{E}\!\left[ h(Y) \right]. \]

Proof

Using 2D LotUS, we see that \[\begin{align*} \text{E}\!\left[ XY \right] &= \sum_x \sum_y xy f_{X, Y}(x,y) \qquad (\text{$X$ and $Y$ are independent}) \\ &= \sum_x \sum_y xy f_X(x) f_Y(y) \\ &= \left( \sum_x x f_X(x) \right) \left( \sum_y y f_Y(y) \right) \\ &= \text{E}\!\left[ X \right] \text{E}\!\left[ Y \right]. \end{align*}\]

The proof of the second part is similar.

However, unlike linearity of expectation, Theorem 14.3 only holds when the random variables are independent. In general, to evaluate \(\text{E}\!\left[ XY \right]\), we need to use 2D LotUS.

14.4 Exercises

Exercise 14.1 (Using 2D LotUS) Using the joint PMF from Exercise 13.1,

  1. Compute \(\text{E}\!\left[ X+Y \right]\) using 2D LotUS. Is it equal to \(\text{E}\!\left[ X \right] + \text{E}\!\left[ Y \right]\)? Why or why not?
  2. Compute \(\text{E}\!\left[ XY \right]\) using 2D LotUS. Is it equal to \(\text{E}\!\left[ X \right] \text{E}\!\left[ Y \right]\)? Why or why not?

Exercise 14.2 (Expected number of Secret Santa matches, revisited) In Exercise 9.1, you showed that the expected number of friends who draw their own name in Secret Santa is \(1\), when there are \(n = 4\) friends. You calculated this using the PMF of the number of friends who draw their own name, which you derived in Exercise 8.1.

How many friends are expected to draw their own name if there are \(n\) friends? There is no simple expression in general for the PMF of \(X\), the number of friends who draw their own name. Nevertheless, derive \(\text{E}\!\left[ X \right]\) using linearity of expectation.

Exercise 14.3 (Expected number of birthday matches) In Example 2.3, we showed that the probability that at least two people in a room of \(n\) people share a birthday is surprisingly high, even for a moderately sized room, such as \(n = 30\).

  1. Calculate the expected number of days with a shared birthday. (Your expression should be in terms of \(n\), but try plugging in specific values of \(n\) to see if your expected value makes sense in light of Example 2.3.)
  2. Calculate the expected number of pairs of people who share a birthday. (For example, if Aldo, Bri, and Chen all have a birthday on April 20, then that would be three pairs of people who share a birthday: Aldo and Bri, Aldo and Chen, and Bri and Chen.)

Exercise 14.4 (Expectation of the negative binomial) In Proposition 12.2, we showed directly that the expectation of a \(\text{NegativeBinomial}(r, p)\) random variable \(X\) is \(\text{E}\!\left[ X \right] = \frac{r}{p}\).

Express \(X\) as a sum of simpler random variables, and provide a simpler derivation of this result using linearity of expectation. (Hint: You may use the result from Example 9.7.)

Exercise 14.5 (Another proof of inclusion-exclusion) Let \(A_1, A_2, \dots, A_n\) be events. Let \(I_{\bigcup_i A_i}\) be the indicator that at least one of the events occurs, and let \(I_k\) be the indicator of the event \(A_k\).

  1. Express \(I_{\bigcup_i A_i}\) in terms of \(I_1, \dots, I_n\).
  2. Use linearity of expectation to calculate \(\text{E}\!\left[ I_{\bigcup_i A_i} \right]\) and establish the inclusion-exclusion formula Equation 4.6.

Exercise 14.6 (Happy Meals) McDonald’s decides to give a Pokemon toy with each Happy Meal purchased. Each time you purchase a Happy Meal, you are equally likely to get any one of the 6 types of Pokemon toys. What is the expected number of Happy Meals you need to purchase until you “catch them all”?

Hint: Express your random variable as a sum of geometric random variables.

Exercise 14.7 (Cars in traffic) Suppose that \(n\) cars are traveling, each at a different constant speed, down a one-lane road. Assuming that the cars are equally likely to be in any order, what is the expected number of cars that will eventually be stuck behind a slower car?

Exercise 14.8 (Expected number of hash collisions) In Exercise 2.8, you calculated the probability of a collision in a hash table. In this exercise, we will consider the expected number of collisions in a hash table with \(k\) records and \(n\) addresses, assuming \(k \leq n\).

  1. Calculate the expected number of “empty” addresses (i.e., with no addresses stored).
  2. Calculate the expected number of addresses with collisions.

Exercise 14.9 (Quicksort) Sorting is a fundamental problem in computer science. Here is one recipe for sorting a list, such as \[ [3, 8, 1, 4, 9, 6], \] called quicksort.

  1. Pick a random number from the list to serve as a “pivot”. Suppose we pick the number \(4\).
  2. Organize the list into three parts: a list of numbers less than pivot, the pivot itself, and a list of numbers greater than the pivot. For example, if the pivot is \(4\), then the above list becomes \[ [3, 1], \underline{4}, [8, 9, 6]. \]
  3. Now, for each of the lists above, repeat steps 1 and 2. If a list only has one number or is empty, then it is sorted. Applying this to the above list, suppose we choose \(3\) as the pivot for the left list and \(8\) as the pivot for the right list. Then, we end up with the following list, which is sorted. \[ [1], \underline{3}, [], 4, [6], \underline{8}, [9] \]

How long does it take for quicksort to sort a list of \(n\) numbers? One way to measure this is to count the number of comparisons that are made between two numbers.

  1. Let \(X\) be the number of comparisons made by quicksort (when a pivot is chosen randomly). Define indicator random variables \(I_{jk}\), where \(j < k\) correspond to the \(j\)th smallest and \(k\)th smallest numbers, \(x_{(j)}\) and \(x_{(k)}\), such that \[ X = \sum_{j=1}^{n-1} \sum_{k=j+1}^n I_{jk}. \]
  2. Calculate \(\text{E}\!\left[ X \right]\). (Hint: In order for a comparison to be made between two numbers, one of them must be chosen as a pivot when they are in the same list. If any number between them is chosen as a pivot first, then they will end up in different lists and never be compared.)
  3. Come up with a simple approximation for \(\text{E}\!\left[ X \right]\) by approximating the sums by integrals.

Exercise 14.10 (Ratio of die rolls) Suppose you roll a fair die twice. Let \(X\) be the outcome of the first roll and \(Y\) be the outcome of the second roll.

  1. Compute \(\text{E}\!\left[ \displaystyle \frac{X}{Y} \right]\). Is this equal to \(\displaystyle \frac{\text{E}\!\left[ X \right]}{\text{E}\!\left[ Y \right]}\)?
  2. Explain how you could use Theorem 14.3 here to obtain the correct answer.