27  Conditional Expectations

In Chapter 17, we defined the notation \[ \text{E}\!\left[ Y | X \right] \] and used it to calculate \(\text{E}\!\left[ Y \right]\). The trick was the Law of Total Expectation, which we now prove for continuous random variables.

Theorem 27.1 (Law of total expectation) \[ \text{E}\!\left[ Y \right] = \text{E}\!\left[ \text{E}\!\left[ Y | X \right] \right]. \]

First, \(\text{E}\!\left[ Y|X \right] \overset{\text{def}}{=}g(X)\), where \[ g(x) = \text{E}\!\left[ Y|X=x \right] = \int_{-\infty}^\infty y f_{Y|X}(y|x)\,dy. \] Note that to calculate \(\text{E}\!\left[ Y|X=x \right]\), we simply integrate \(y\) times the conditional PDF of \(Y\) given \(X\).

Therefore, \[ \begin{align} \text{E}\!\left[ \text{E}\!\left[ Y|X \right] \right] &= \text{E}\!\left[ g(X) \right] & \text{(definition of $\text{E}\!\left[ Y|X \right]$)} \\ &= \int_{-\infty}^\infty g(x) f_X(x)\,dx & \text{(LotUS)} \\ &= \int_{-\infty}^\infty \left[\int_{-\infty}^\infty y f_{Y|X}(y|x)\,dy \right] f_X(x)\,dx & \text{(formula for $g(x)$)} \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty y f_{Y|X}(y|x) f_X(x) \,dy \,dx & \text{(bring $f_X(x)$ inside inner integral)} \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty y f_{X, Y}(x, y) \,dy \,dx & \text{(definition of conditional PDF)} \\ &= \text{E}\!\left[ Y \right] & \text{(2D LotUS)}. \end{align} \]

The Law of Total Expectation can make it easy to calculate expected values.

Example 27.1 (Bayes’ billiard balls and expectation) Consider Example 26.2. What is \(\text{E}\!\left[ X \right]\), the expected number of balls to the left of the first ball (wherever it landed)?

By the Law of Total Expectation (Theorem 27.1), \[ \text{E}\!\left[ X \right] = \text{E}\!\left[ \text{E}\!\left[ X | U \right] \right] = \text{E}\!\left[ nU \right] = n\text{E}\!\left[ U \right] = \frac{n}{2}. \] In the second equality, we used the fact that \(X|U\) is \(\text{Binomial}(n, p=U)\), so its (conditional) expectation is \(\text{E}\!\left[ X|U \right] = nU\).

We can check this answer because we actually determined the PMF of \(X\) in Example 26.2. It was \[ f_X(k) = \frac{1}{n+1}; \qquad k=0, 1, \dots, n. \] By the formula for expectation, \[ \text{E}\!\left[ X \right] = \sum_{k=0}^n k\frac{1}{n+1} = \frac{n(n+1)}{2} \frac{1}{n+1} = \frac{n}{2},\] which matches the answer above. But the beauty of the Law of Total Expectation is that we did not need to determine the marginal PMF of \(X\) first.

Theorem 27.1 is one property of conditional expectations, but there are many other properties, which are identical to the properties in Chapter 17. We restate those properties here.

Proposition 27.1 (Linearity of conditional expectation) \[ \begin{aligned} \text{E}\!\left[ Y_1 + Y_2 | X \right] = \text{E}\!\left[ Y_1 | X \right] + \text{E}\!\left[ Y_2 | X \right] \end{aligned} \]

Proposition 27.2 (Pulling out what’s given) \[ \begin{aligned} \text{E}\!\left[ g(X) Y | X \right] = g(X) \text{E}\!\left[ Y | X \right] \end{aligned} \]

Proposition 27.3 (Conditional expectation of an independent random variable) If \(X\) and \(Y\) are independent random variables, then \[ \text{E}\!\left[ Y | X \right] = \text{E}\!\left[ Y \right]. \]

Remember that the conditional expectation \(\text{E}\!\left[ Y|X \right]\) is a function of \(X\). In fact, it is the function of \(X\) that best predicts \(Y\), in the following sense.

Proposition 27.4 (Conditional expectation and prediction) Let \(X\) and \(Y\) be random variables. Then, the function of \(X\) that best “predicts” \(Y\) in the sense of minimizing \[ \text{E}\!\left[ (Y - g(X))^2 \right] \] is \(g(X) = \text{E}\!\left[ Y|X \right]\).

Proof

We will prove a stronger statement, that \(\text{E}\!\left[ Y|X \right]\) in fact minimizes \(\text{E}\!\left[ (Y - g(X))^2 | X \right]\) (i.e., for every value of \(X\)), so it also minimizes \[ \text{E}\!\left[ (Y - g(X))^2 \right] = \text{E}\!\left[ \text{E}\!\left[ (Y - g(X))^2 | X \right] \right]. \]

To do this, we add and subtract \(\text{E}\!\left[ Y | X \right]\), then expand the square. \[\begin{align} \text{E}\!\left[ (Y - g(X))^2 | X \right] &= \text{E}\big[(\underbrace{Y - \text{E}\!\left[ Y | X \right]}_A + \underbrace{\text{E}\!\left[ Y|X \right] - g(X)}_B)^2 | X\big] \\ &= \text{E}\big[(\underbrace{Y - \text{E}\!\left[ Y | X \right]}_A)^2 | X\big] + \text{E}\big[(\underbrace{\text{E}\!\left[ Y | X \right] - g(X)}_B)^2 | X\big] \\ &\qquad + 2 \text{E}\big[(\underbrace{Y - \text{E}\!\left[ Y | X \right]}_A)(\underbrace{\text{E}\!\left[ Y|X \right] - g(X)}_B) | X \big]. \end{align}\]

Next, we show that the cross-term (i.e., the last term) is zero. Because \(B\) is a function of \(X\), we can pull it outside the conditional expectation by Proposition 27.2. Then, we can expand the conditional expectation to see that it is zero: \[\begin{align} \text{E}\big[(Y - \text{E}\!\left[ Y | X \right])(\underbrace{\text{E}\!\left[ Y|X \right] - g(X)}_B) | X \big] &= (\underbrace{\text{E}\!\left[ Y|X \right] - g(X)}_B) \text{E}\big[(Y - \text{E}\!\left[ Y | X \right]) | X \big] \\ &= (\text{E}\!\left[ Y|X \right] - g(X)) (\underbrace{\text{E}\!\left[ Y|X \right] - \text{E}\!\left[ Y|X \right]}_0) \\ &= 0 \end{align}\]

Therefore, we have established the following identity: \[ \text{E}\!\left[ (Y - g(X))^2 | X \right] = \text{E}\big[(Y - \text{E}\!\left[ Y | X \right])^2 | X\big] + \text{E}\big[(\text{E}\!\left[ Y | X \right] - g(X))^2 | X\big]. \] On the right-hand side, \(g(X)\) only appears in the second term, which is always non-negative. It can be minimized (i.e., made equal to \(0\)) by choosing \(g(X) = \text{E}\!\left[ Y|X \right]\).

27.1 Law of Total Variance

There is an analogous result for calculating variance from conditional variances.

Theorem 27.2 (Law of total variance) \[ \text{Var}\!\left[ Y \right] = \text{E}\!\left[ \text{Var}\!\left[ Y | X \right] \right] + \text{Var}\!\left[ \text{E}\!\left[ Y | X \right] \right]. \]

The statement of Theorem 27.2 relies on Definition 17.2 for conditional variance, which is the same for discrete and continuous random variables. Therefore, the proof of Theorem 27.2 is also identical to the proof in Theorem 17.2.

Example 27.2 (Expectation and variance of \(T\)) Recall Example 26.5, where we defined the random variable \[ T = \frac{Z}{V}, \] where \(Z\) is \(\textrm{Normal}(\mu= 0, \sigma^2= 1)\) and \(V\) is independent \(\textrm{Exponential}(\lambda=1)\). \(T\) is said to follow a \(t\)-distribution with \(d=2\) degrees of freedom.

We derived the PDF of this distribution. Of course, the expectation and variance could be calculated directly from the PDF, but it is much easier to use the Laws of Total Expectation and Variance.

We argued in Example 26.5 that \(T | \{ V = v \}\) is a \(\textrm{Normal}(\mu= 0, \sigma^2= \frac{1}{v^2})\), so

  • \(\text{E}\!\left[ T | V \right] = 0\) and
  • \(\text{Var}\!\left[ T | V \right] = 1 / V^2\).

Now, by the Law of Total Expectation, \[ \text{E}\!\left[ T \right] = \text{E}\!\left[ \text{E}\!\left[ T | V \right] \right] = \text{E}\!\left[ 0 \right] = 0 \] and by the Law of Total Variance, \[ \begin{aligned} \text{Var}\!\left[ T \right] &= \text{E}\!\left[ \text{Var}\!\left[ T | V \right] \right] + \text{Var}\!\left[ \text{E}\!\left[ T | V \right] \right] \\ &= \text{E}\!\left[ 1 / V^2 \right] + \text{Var}\!\left[ 0 \right] \\ &= \text{E}\!\left[ 1 / V^2 \right]. \end{aligned} \]

Since \(V\) is standard exponential, we can determine \(\text{E}\!\left[ 1 / V^2 \right]\) using LotUS:

\[ \text{E}\!\left[ 1/V^2 \right] = \int_0^\infty \frac{1}{v^2} e^{-v}\,dv = \infty. \]

Therefore, the \(t\)-distribution with \(d = 2\) degrees of freedom has expectation zero and infinite variance.

Example 27.3 (Bayes’ billiard balls and variance) Consider Example 26.2. What is \(\text{Var}\!\left[ X \right]\), the variance of the number of balls to the left of the first ball (wherever it landed)?

Calculating this directly using the PDF of \(X\) that we derived in Example 26.2 is not easy because it involves evaluating a sum of squares \(\text{E}\!\left[ X^2 \right] = \sum_{k=0}^n \frac{k^2}{n+1}\). Instead, we will use the Law of Total Variance (Theorem 27.2).

Since \(X|U\) follows a \(\text{Binomial}(n, p=U)\) distribution, we know that \(\text{E}\!\left[ X|U \right] = nU\) and \(\text{Var}\!\left[ X|U \right] = nU(1-U)\). By the Law of Total Variance, \[\begin{align} \text{Var}\!\left[ X \right] &= \text{E}\!\left[ \text{Var}\!\left[ X | U \right] \right] + \text{Var}\!\left[ \text{E}\!\left[ X | U \right] \right] \\ &= \text{E}\!\left[ nU(1 - U) \right] + \text{Var}\!\left[ nU \right] \\ &= n(\text{E}\!\left[ U \right] - \text{E}\!\left[ U^2 \right]) + n^2 \text{Var}\!\left[ U \right] \\ &= n\left(\frac{1}{2} - \frac{1}{3} \right) + n^2 \frac{1}{12} \\ &= \frac{n(n+2)}{12}. \end{align}\]

As a side benefit, we also see that \[\text{E}\!\left[ X^2 \right] = \text{Var}\!\left[ X \right] + \text{E}\!\left[ X \right]^2 = \frac{n(n+2)}{12} + \left(\frac{n}{2}\right)^2 = \frac{n(2n+1)}{6},\] which leads to a formula for the sum of the first \(n\) perfect squares: \[ \sum_{k=0}^n k^2 = \text{E}\!\left[ X^2 \right] (n+1) = \frac{n(n+1)(2n+1)}{6}. \]