17  Conditional Expectations

\[ \def\mean{\textcolor{red}{85.2}} \def\prob{\textcolor{blue}{.3}} \]

In Chapter 16, we saw that it is sometimes easier to calculate an expected value \(\text{E}[ X ]\) by first calculating conditional expectations \(\text{E}[ X | N = n ]\). The key trick was the Law of Total Expectation (Theorem 16.1).

In this chapter, we introduce additional notation for conditional expectations that will unlock sophisticated calculations. But before defining this new notation formally, we introduce it using an example.

Example 17.1 (Corned Beef Expectations) In Example 16.3, we used the fact that \(X | N = n\) is \(\text{Binomial}(n, p=\prob)\), so \[ \text{E}[ X | N = n ] = n(\prob).\]

Notice that \(\text{E}[ X | N = n ]\) is a function of \(n\). If we now let \(N\) range over all its possible values, we can regard the conditional expectation as a function of the random variable \(N\).

We will use the notation \[ \text{E}[ X | N ] \overset{\text{def}}{=}N (\prob)\] for this function of \(N\).

Now, let’s define the notation \(\text{E}[ X | N ]\) more formally.

Definition 17.1 (Conditional Expectation Given a Random Variable) Let \(g(n) \overset{\text{def}}{=} \text{E}[ X | N = n ]\) be a function of \(n\) which specifies the expectation conditional on that particular value of \(N\), as in Equation 16.5.

Then, the expectation of \(X\) conditional on the random variable \(N\) is defined as \[\text{E}[ X | N ] \overset{\text{def}}{=}g(N).\]

This notation allows for a compact statement of the Law of Total Expectation.

Theorem 17.1 (Law of Total Expectation) \[ \text{E}[ X ] = \text{E}[ \text{E}[ X | N ] ]. \]

We already know from Theorem 16.1 that \[ \text{E}[ X ] = \sum_n P(N = n) \text{E}[ X | N = n ], \tag{17.1}\] so all we need to do is show that the right-hand side is equivalent to \(\E{\text{E}[ X | N ]]\). To do this, we just need to unpack Definition 17.1.

\[\begin{align*} \text{E}[ \text{E}[ X | N ] ] &= \text{E}[ g(N) ] \\ &= \sum_n g(n) P(N = n) & \text{(LotUS)} \\ &= \sum_n \text{E}[ X | N=n ] P(N = n) & \text{(definition of $g(n)$)} \\ &= \text{E}[ X ] \end{align*}\]

This version of the Law of Total Expectation yields an elegant calculation of the corned beef example (Example 17.1), without any complicated algebra.

Example 17.2 (Corned Beef Expectations) We showed in Example 17.1 that \[ \text{E}[ X | N ] = N (\prob). \] Substituting this into the Law of Total Expectation (Equation 17.1), we obtain: \[ \begin{aligned} \text{E}[ X ] &= \text{E}[ \text{E}[ X | N ] ] \\ &= \text{E}[ N(\prob) ] \\ &= \text{E}[ N ] (\prob) \\ &= \mean (\prob) \\ &= 25.56. \end{aligned} \]

17.1 Properties of Conditional Expectation

Theorem 17.1 is one property of conditional expectations, but there are many other properties, most of which are analogous to properties of ordinary expectations.

For example, conditional expectations are linear. This fact follows from the ordinary linearity of expectation because the conditional PMF is itself a PMF.

Proposition 17.1 (Linearity of Conditional Expectation) \[ \begin{aligned} \text{E}[ Y_1 + Y_2 | X ] = \text{E}[ Y_1 | X ] + \text{E}[ Y_2 | X ] \end{aligned} \tag{17.2}\]

The next property is analogous to “pulling out constants” from ordinary expectation. Conditional on \(X\), any function \(g(X)\) is a constant, so it can be pulled outside the conditional expectation.

Proposition 17.2 (Pulling Out What’s Given) \[ \begin{aligned} \text{E}[ g(X) Y | X ] = g(X) \text{E}[ Y | X ] \end{aligned} \tag{17.3}\]

Note that Equation 17.3 is a random variable that is a function of \(X\), in line with what we would expect, since it is a conditional expectation with respect to \(X\).

There are also properties specific to conditional expectation. The next property concerns the conditional expectation when the two random variables are independent.

Proposition 17.3 (Conditional Expectation of an Independent Random Variable) If \(X\) and \(Y\) are independent random variables, then \[ \text{E}[ Y | X ] = \text{E}[ Y ]. \]

Let \(g(x) = \text{E}[ Y | X = x ]\). We need to show that \(\text{E}[ Y | X ] = g(X) = \text{E}[ Y ]\). (Note that \(\text{E}[ Y ]\) is a constant.)

To do this, it suffices to show that \(g(x) = \text{E}[ Y ]\) for any \(x\).

\[\begin{align*} g(x) &= \text{E}[ Y | X = x ] \\ &= \sum_y y P(Y = y | X = x) & \text{(definition of conditional expectation)} \\ &= \sum_y y P(Y = y) & \text{($X$ and $Y$ are independent)} \\ &= \text{E}[ Y ] & \text{(definition of expected value)} \end{align*}\]

Now, we apply these properties to prove a beautiful result about the expectation of the sum of a random number of random variables, named for the statistician Abraham Wald.

Example 17.3 (Wald’s Identity) Let \(X_1, X_2, ...\) be i.i.d. random variables with expected value \(\mu\). Let \(N\) be a random variable, independent of \(X_1, X_2, ...\), that only takes on the values \(0, 1, 2, ...\).

Suppose that we sum a random number \(N\) of the random variables \(X_i\). That is, define the random variable \[ S \overset{\text{def}}{=} \sum_{i=1}^N X_i. \] What is its expected value, \(\text{E}[ S ]\)?

The key is to condition on \(N\) so that \(S\) is a sum of a fixed number of random variables.

\[ \begin{aligned} \text{E}[ S | N ] &= \text{E}\Big[\sum_{i=1}^N X_i \Big| N \Big] \\ &= \sum_{i=1}^N \text{E}[ X_i | N ] & \text{(linearity of conditional expectation)} \\ &= \sum_{i=1}^N \underbrace{\text{E}[ X_i ]}_{\mu} & \text{(independence of $N$ and $X_i$)} \\ &= N \mu & \text{($X_1, X_2, ...$ are iid)} \end{aligned} \] Notice that this conditional expectation is a function of the random variable \(N\).

By the Law of Total Expectation (Theorem 17.1), the expected value is

\[ \text{E}[ S ] = \text{E}[ \text{E}[ S | N ] ] = \text{E}[ N \mu ] = \text{E}[ N ] \mu. \]

Wald’s Identity is intuitive. On average, the sum should be the expected value of each random variable, multiplied by how many random variables we expect to have.

Wald’s Identity is also very useful. In fact, Example 17.2 is a special case of Wald’s Identity. If each \(X_i\) is a \(\text{Bernoulli}(p=\prob)\) random variable, where \(X_i = 1\) indicates that the \(i\)th customer purchased a corned beef sandwich, then \(S\) represents the number of corned beef sandwiches sold. Using Wald’s Identity, we see that the expected number of corned beef sandwiches sold is: \[ \text{E}[ S ] = \text{E}[ N ] \mu = \mean \cdot \prob = 25.56. \]

17.2 Law of Total Variance

What about the variance of the number of corned beef sandwiches sold? Because we know that \(X\) follows a \(\text{Poisson}(\mu=\mean \cdot \prob)\) distribution, the answer must be \[ \text{Var}[ X ] = \mu = \mean \cdot \prob = 25.56. \]

But what if we did not already know the marginal distribution of \(X\)? Is there a way to calculate \(\text{Var}[ X ]\) from the conditional distribution \(X | N\)?

The answer is yes, but in order to state the result, we first need to define the conditional variance.

Definition 17.2 (Conditional Variance) The variance of \(X\) conditional on a random variable \(N\) is defined as \[ \text{Var}[ X | N ] \overset{\text{def}}{=}\text{E}[ (X - \text{E}[ X | N ])^2 | N ]. \tag{17.4}\]

There is a shortcut formula for the conditional variance, analogous to the shortcut formula for ordinary variance: \[ \text{Var}[ X | N ] = \text{E}[ X^2 | N ] - \text{E}[ X | N ]^2 \tag{17.5}\]

All of these formulas are analogous to the formulas for ordinary variance, except that all expectations are conditional on \(N\).

The Law of Total Variance provides a way to calculate \(\text{Var}[ X ]\) from the conditional variance \(\text{Var}[ X|N ]\) and the conditional mean \(\text{E}[ X | N ]\).

Theorem 17.2 (Law of Total Variance) \[ \text{Var}[ X ] = \text{E}[ \text{Var}[ X | N ] ] + \text{Var}[ \text{E}[ X | N ] ]. \]

Proof

By the shortcut formula for conditional variance (Equation 17.5), the first term on the right-hand side is \[ \text{E}[ \text{Var}[ X|N ] ] = \text{E}\big[\text{E}[ X^2 | N ] - \text{E}[ X | N ]^2\big] = \text{E}[ X^2 ] - \text{E}[ \text{E}[ X | N ]^2 ].\]

By the shortcut formula for ordinary variance, applied to the random variable \(Y \overset{\text{def}}{=}\text{E}[ X | N ]\), the second term on the right-hand side is \[ \underbrace{\text{Var}[ \text{E}[ X | N ] ]}_{\text{Var}[ Y ]} = \underbrace{\text{E}[ \text{E}[ X | N ]^2 ]}_{\text{E}[ Y^2 ]} - \underbrace{\text{E}[ \text{E}[ X | N ] ]^2}_{\text{E}[ Y ]^2} = \text{E}[ \text{E}[ X | N ]^2 ] - \text{E}[ X ]^2 \]

Adding the two expressions above, we obtain: \[ \begin{align} \text{E}[ \text{Var}[ X|N ] ] + \text{Var}[ \text{E}[ X | N ] ] &= \text{E}[ X^2 ] - \text{E}[ X ]^2 \\ &= \text{Var}[ X ]. \end{align} \]

Let’s use the Law of Total Variance to calculate the variance of the number of corned beef sandwiches sold.

Example 17.4 (Corned Beef Variance) Since \(X | N\) is binomial with \(p=\prob\), the conditional expectation is \[ \text{E}[ X | N ] = N (\prob) \] and the conditional variance is \[ \text{Var}[ X | N ] = N (\prob) (1 - \prob). \]

Now, using the Law of Total Variance and pulling out constants, we see that \[ \begin{aligned} \text{Var}[ X ] &= \text{E}[ \text{Var}[ X|N ] ] + \text{Var}[ \text{E}[ X | N ] ] \\ &= \text{E}[ N(\prob)(1 - \prob) ] + \text{Var}[ N(\prob) ] \\ &= \text{E}[ N ](\prob)(1 - \prob) + \text{Var}[ N ](\prob)^2 \\ &= (\mean)(\prob)(1 - \prob) + (\mean)(\prob)^2 \\ &= 25.56, \end{aligned} \] which matches the answer that we got by using the fact that \(X\) is Poisson. However, the Law of Total Variance does not require first finding the distribution of \(X\).

The Law of Total Variance is a decomposition of the variance of \(X\) into two components based on \(N\). If we think of \(N\) as indicating membership in a group, then

  • \(\text{E}[ X | N ]\) represents the mean of each group, and
  • \(\text{Var}[ X | N ]\) represents the variance within each group.

The two components of \(\text{Var}[ X ]\) according to the Law of Total Variance are

  • \(\text{E}[ \text{Var}[ X | N ] ]\), which measures how much variance there is within groups (on average), and
  • \(\text{Var}[ \text{E}[ X | N ] ]\), which measures how much variance there is between the means of the groups.

For this reason, the Law of Total Variance is also called the “variance decomposition formula”. This variance decomposition formula is the basis of statistical methods such as Analysis of Variance (ANOVA), which compare the between-group variance to the within-group variance.

Common Error!

A common mistake is to calculate \(\text{Var}[ X ]\) as

\[ \sum_n P(N = n) \text{Var}[ X | N = n ]. \] This formula is tempting because it “looks like” the Law of Total Probability and the Law of Total Expectation.

But the Law of Total Variance takes a different form! The formula above is just the within-group component of the Law of Total Variance, \[ \text{E}[ \text{Var}[ X | N ] ], \] but the Law of Total Variance also has a between-group component, \[ \text{Var}[ \text{E}[ X | N ] ]. \]

17.3 Exercises

Exercise 17.1 A fair, six-sided die is rolled repeatedly until a one comes up. Let \(X\) be the number of sixes that were rolled in the meantime.

  1. What is \(\text{E}[ X ]\)?
  2. What is \(\text{Var}[ X ]\)?