27  Conditional Expectations

In Chapter 17, we defined the notation \[ \text{E}\!\left[ Y \mid X \right] \] and used it to calculate \(\text{E}\!\left[ Y \right]\). The trick was the Law of Total Expectation, which we now prove for continuous random variables.

Theorem 27.1 (Law of total expectation) \[ \text{E}\!\left[ Y \right] = \text{E}\!\left[ \text{E}\!\left[ Y \mid X \right] \right]. \]

First, \(\text{E}\!\left[ Y\mid X \right] \overset{\text{def}}{=}g(X)\), where \[ g(x) = \text{E}\!\left[ Y\mid X=x \right] = \int_{-\infty}^\infty y f_{Y\mid X}(y\mid x)\,dy. \] Note that to calculate \(\text{E}\!\left[ Y\mid X=x \right]\), we simply integrate \(y\) times the conditional PDF of \(Y\) given \(X\).

Therefore, \[ \begin{align} \text{E}\!\left[ \text{E}\!\left[ Y\mid X \right] \right] &= \text{E}\!\left[ g(X) \right] & \text{(definition of $\text{E}\!\left[ Y\mid X \right]$)} \\ &= \int_{-\infty}^\infty g(x) f_X(x)\,dx & \text{(LotUS)} \\ &= \int_{-\infty}^\infty \left[\int_{-\infty}^\infty y f_{Y\mid X}(y\mid x)\,dy \right] f_X(x)\,dx & \text{(formula for $g(x)$)} \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty y f_{Y\mid X}(y\mid x) f_X(x) \,dy \,dx & \text{(bring $f_X(x)$ inside inner integral)} \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty y f_{X, Y}(x, y) \,dy \,dx & \text{(definition of conditional PDF)} \\ &= \text{E}\!\left[ Y \right] & \text{(2D LotUS)}. \end{align} \]

The Law of Total Expectation can make it easy to calculate expected values.

Example 27.1 (Free throws and expectation) Consider Example 26.3. What is \(\text{E}\!\left[ X \right]\), the expected number of free throws that the randomly chosen person makes?

By the Law of Total Expectation (Theorem 27.1), \[ \text{E}\!\left[ X \right] = \text{E}\!\left[ \text{E}\!\left[ X \mid S \right] \right] = \text{E}\!\left[ nS \right] = n\text{E}\!\left[ S \right] = n\frac{1}{2}. \] In the second equality, we used the fact that \(X \mid \{S=s\}\) is \(\text{Binomial}(n, p=s)\), so its (conditional) expectation is \(\text{E}\!\left[ X \mid S=s \right] = ns\).

We can check this answer because we actually determined the PMF of \(X\) in Example 26.3. Since it is \[ f_X(x) = \frac{1}{n+1}; \qquad x=0, 1, \dots, n, \] we can use the usual formula for expectation (Definition 9.1): \[ \text{E}\!\left[ X \right] = \sum_{k=0}^n k\frac{1}{n+1} = \frac{n(n+1)}{2} \frac{1}{n+1} = \frac{n}{2},\] which matches the answer above. But the beauty of the Law of Total Expectation is that we did not need to determine the marginal PMF of \(X\) first.

Example 27.2 (Expected height) In Example 26.2, we saw that the height of a randomly chosen man (in centimeters) follows a \(\textrm{Normal}(\mu= 178.4, \sigma^2= 7.59^2)\) distribution, while the height of a randomly chosen woman (in centimeters) follows a \(\textrm{Normal}(\mu= 164.3, \sigma^2= 7.07^2)\) distribution.

What is the expected height of a randomly chosen person, \(\text{E}\!\left[ H \right]\)? We know that \[ \begin{align} \text{E}\!\left[ H \mid I=0 \right] &= 178.4 & \text{E}\!\left[ H \mid I = 1 \right] = 164.3, \end{align} \] so by the Law of Total Expectation: \[ \begin{align} \text{E}\!\left[ H \right] = \text{E}\!\left[ \text{E}\!\left[ H|I \right] \right] &= \text{E}\!\left[ H \mid I=0 \right] P(I=0) + \text{E}\!\left[ H \mid I=1 \right] P(I=1) \\ &= 178.4 \cdot .51 + 164.3 \cdot .49 \\ &\approx 171.5. \end{align} \]

Theorem 27.1 is one property of conditional expectations, but there are many other properties, which are identical to the properties in Chapter 17. We restate those properties here.

Proposition 27.1 (Linearity of conditional expectation) \[ \begin{aligned} \text{E}\!\left[ Y_1 + Y_2 \mid X \right] = \text{E}\!\left[ Y_1 \mid X \right] + \text{E}\!\left[ Y_2 \mid X \right] \end{aligned} \]

Proposition 27.2 (Pulling out what’s given) \[ \begin{aligned} \text{E}\!\left[ g(X) Y \mid X \right] = g(X) \text{E}\!\left[ Y \mid X \right] \end{aligned} \]

Proposition 27.3 (Conditional expectation of an independent random variable) If \(X\) and \(Y\) are independent random variables, then \[ \text{E}\!\left[ Y \mid X \right] = \text{E}\!\left[ Y \right]. \]

Remember that the conditional expectation \(\text{E}\!\left[ Y\mid X \right]\) is a function of \(X\). In fact, it is the function of \(X\) that best predicts \(Y\), in the following sense.

Proposition 27.4 (Conditional expectation and prediction) Let \(X\) and \(Y\) be random variables. Then, the function of \(X\) that best “predicts” \(Y\) in the sense of minimizing \[ \text{E}\!\left[ (Y - g(X))^2 \right] \] is \(g(X) = \text{E}\!\left[ Y\mid X \right]\).

Proof

We will prove a stronger statement, that \(\text{E}\!\left[ Y\mid X \right]\) in fact minimizes \(\text{E}\!\left[ (Y - g(X))^2 \mid X \right]\) (i.e., for every value of \(X\)), so it also minimizes \[ \text{E}\!\left[ (Y - g(X))^2 \right] = \text{E}\!\left[ \text{E}\!\left[ (Y - g(X))^2 \mid X \right] \right]. \]

To do this, we add and subtract \(\text{E}\!\left[ Y \mid X \right]\), then expand the square. \[\begin{align} \text{E}\!\left[ (Y - g(X))^2 \mid X \right] &= \text{E}\big[(\underbrace{Y - \text{E}\!\left[ Y \mid X \right]}_A + \underbrace{\text{E}\!\left[ Y\mid X \right] - g(X)}_B)^2 \mid X\big] \\ &= \text{E}\big[(\underbrace{Y - \text{E}\!\left[ Y \mid X \right]}_A)^2 \mid X\big] + \text{E}\big[(\underbrace{\text{E}\!\left[ Y \mid X \right] - g(X)}_B)^2 \mid X\big] \\ &\qquad + 2 \text{E}\big[(\underbrace{Y - \text{E}\!\left[ Y \mid X \right]}_A)(\underbrace{\text{E}\!\left[ Y\mid X \right] - g(X)}_B) \mid X \big]. \end{align}\]

Next, we show that the cross-term (i.e., the last term) is zero. Because \(B\) is a function of \(X\), we can pull it outside the conditional expectation by Proposition 27.2. Then, we can expand the conditional expectation to see that it is zero: \[\begin{align} \text{E}\big[(Y - \text{E}\!\left[ Y \mid X \right])(\underbrace{\text{E}\!\left[ Y\mid X \right] - g(X)}_B) \mid X \big] &= (\underbrace{\text{E}\!\left[ Y\mid X \right] - g(X)}_B) \text{E}\big[(Y - \text{E}\!\left[ Y \mid X \right]) \mid X \big] \\ &= (\text{E}\!\left[ Y\mid X \right] - g(X)) (\underbrace{\text{E}\!\left[ Y\mid X \right] - \text{E}\!\left[ Y\mid X \right]}_0) \\ &= 0 \end{align}\]

Therefore, we have established the following identity: \[ \text{E}\!\left[ (Y - g(X))^2 \mid X \right] = \text{E}\big[(Y - \text{E}\!\left[ Y \mid X \right])^2 \mid X\big] + \text{E}\big[(\text{E}\!\left[ Y \mid X \right] - g(X))^2 \mid X\big]. \] On the right-hand side, \(g(X)\) only appears in the second term, which is always non-negative. It can be minimized (i.e., made equal to \(0\)) by choosing \(g(X) = \text{E}\!\left[ Y \mid X \right]\).

Proposition 27.4 has far-reaching consequences, such as the following.

Example 27.3 (Predicting skill) Suppose we want to predict the skill \(S\) of the shooter from Example 26.4, based on the observed data \(X\). One reasonable criterion is to minimize \[ \text{E}\!\left[ (S - g(X))^2 \right]. \]

According to Proposition 27.4, the predictor that minimizes the above criterion is \[ g(X) = \text{E}\!\left[ S | X \right]. \]

Since we know the PDF of \(S|X\), we can calculate this conditional expectation for the observed value of \(X = 12\): \[ \text{E}\!\left[ S | X=12 \right] = \int_{-\infty}^\infty s f_{S|X}(s|12)\,ds = \int_0^1 s \cdot 19 \binom{18}{12} s^{12} (1 - s)^6\,ds = 0.65. \]

Therefore, our “best” prediction of the shooter’s skill is \(0.65\).

27.1 Law of Total Variance

There is an analogous result for calculating variance from conditional variances.

Theorem 27.2 (Law of total variance) \[ \text{Var}\!\left[ Y \right] = \text{E}\!\left[ \text{Var}\!\left[ Y \mid X \right] \right] + \text{Var}\!\left[ \text{E}\!\left[ Y \mid X \right] \right]. \]

The statement of Theorem 27.2 relies on Definition 17.2 for conditional variance, which is the same for discrete and continuous random variables. Therefore, the proof of Theorem 27.2 is also identical to the proof in Theorem 17.2.

Example 27.4 (Free throws and variance) Consider Example 26.3. What is \(\text{Var}\!\left[ X \right]\), the variance of the number of free throws that the randomly chosen person makes??

Calculating this directly using the PDF of \(X\) that we derived in Example 26.3 is not easy because it involves evaluating a sum of squares \(\text{E}\!\left[ X^2 \right] = \sum_{k=0}^n \frac{k^2}{n+1}\). Instead, we will use the Law of Total Variance (Theorem 27.2).

Since \(X\mid\{S=s\}\) follows a \(\text{Binomial}(n, p=s)\) distribution, we know that \(\text{E}\!\left[ X \mid S=s \right] = ns\) and \(\text{Var}\!\left[ X\mid S=s \right] = ns(1-s)\). By the Law of Total Variance, \[\begin{align} \text{Var}\!\left[ X \right] &= \text{E}\!\left[ \text{Var}\!\left[ X \mid S \right] \right] + \text{Var}\!\left[ \text{E}\!\left[ X \mid S \right] \right] \\ &= \text{E}\!\left[ nS(1 - S) \right] + \text{Var}\!\left[ nS \right] \\ &= n(\text{E}\!\left[ S \right] - \text{E}\!\left[ S^2 \right]) + n^2 \text{Var}\!\left[ S \right] \\ &= n\left(\frac{1}{2} - \frac{1}{3} \right) + n^2 \frac{1}{12} \\ &= \frac{n(n+2)}{12}. \end{align}\]

As a side benefit, we also see that \[\text{E}\!\left[ X^2 \right] = \text{Var}\!\left[ X \right] + \text{E}\!\left[ X \right]^2 = \frac{n(n+2)}{12} + \left(\frac{n}{2}\right)^2 = \frac{n(2n+1)}{6},\] which leads to a formula for the sum of the first \(n\) perfect squares: \[ \sum_{k=0}^n k^2 = \text{E}\!\left[ X^2 \right] (n+1) = \frac{n(n+1)(2n+1)}{6}. \]

Example 27.5 (SD of heights) Continuing Example 27.2, what is the standard deviation of the height of a randomly chosen person, \(\text{SD}\!\left[ H \right]\)?

We will first calculate \(\text{Var}\!\left[ H \right]\) using the following given information: \[ \begin{align} \text{E}\!\left[ H \mid I=0 \right] &= 178.4 & \text{E}\!\left[ H \mid I = 1 \right] &= 164.3 \\ \text{Var}\!\left[ H \mid I=0 \right] &= 7.59^2 & \text{Var}\!\left[ H \mid I = 1 \right] &= 7.07^2. \end{align} \]

Notice that we can write \[ \text{E}\!\left[ H|I \right] = 164.3 I + 178.4 (1 - I) = 178.4 - 14.1 I, \] where \(I \sim \text{Bernoulli}(p=0.49)\). (This turns out to simplify computations later.)

So by the Law of Total Variance: \[ \begin{align} \text{Var}\!\left[ H \right] &= \text{E}\!\left[ \text{Var}\!\left[ H|I \right] \right] + \text{Var}\!\left[ \text{E}\!\left[ H|I \right] \right] \\ &= \left(\text{Var}\!\left[ H \mid I=0 \right] P(I=0) + \text{Var}\!\left[ H \mid I=1 \right] P(I=1)\right) + \text{Var}\!\left[ 178.4 - 14.1 I \right] \\ &= \left(7.59^2 \cdot .51 + 7.07^2 \cdot .49\right) + 14.1^2 (.49)(.51) \\ &\approx 103.555, \end{align} \]

so \(\text{SD}\!\left[ H \right] \approx \sqrt{103.555} \approx 10.2\) centimeters.

Notice in this example that the overall variance, \(\text{Var}\!\left[ H \right]\), is greater than the variance for either men or women, \(\text{Var}\!\left[ H|I \right]\). This is because \(\text{Var}\!\left[ H \right]\) depends not only on \(\text{Var}\!\left[ H|I \right]\), but also on the variability in \(\text{E}\!\left[ H|I \right]\). Because the means for men and women are quite different, the variability in \(\text{E}\!\left[ H|I \right]\) is large.