26  From Conditionals to Marginals

In Chapter 16, we considered situations where the distribution of two discrete random variables \(X\) and \(Y\) was specified by

  1. first specifying the PMF of \(X\),
  2. then specifying the conditional PMF of \(Y\) given \(X\).

We saw that the (marginal) PMF of \(Y\) could be calculated using the Law of Total Probability (Theorem 7.1) \[ f_Y(y) = \sum_x f(x, y) = \sum_x f_X(x) f_{Y|X}(y|x). \tag{26.1}\]

We now generalize Equation 26.1 to the situation where \(X\) or \(Y\) may be continuous.

26.1 Laws of Total Probability

As a warm-up, we revisit Example 23.7, where we had two identical light bulbs whose lifetimes (in years) are i.i.d. \(\textrm{Exponential}(\lambda=0.4)\) random variables \(X\) and \(Y\). We calculated \(P(X < Y)\) in two ways, using double integrals and using symmetry. Now, we offer a third way: using the Law of Total Probability.

Example 26.1 (Lifetimes of light bulbs (Law of Total Probability version)) We want to calculate \(P(B) \overset{\text{def}}{=}P(X < Y)\). This event involves two random variables, so calculating its probability requires a double integral.

However, if we condition on the event \(\{ X = x \}\), we obtain a probability that only involves one random variable: \[ P(B | X = x) = P(X < Y | X = x) = P(x < Y). \] In the last step, we substituted the “known” value of \(X\), and since \(X\) and \(Y\) are independent, the distribution of \(Y\) does not change. Now, plugging in the CDF of the exponential distribution, we see that \[ P(B | X = x) = 1 - F_Y(x) = e^{-0.4 x}; \qquad x > 0. \]

Finally, we need to aggregate these conditional probabilities into the overall probability of \(B\). To do this, we use the continuous analog of the Law of Total Probability. \[ P(B) = \int_{-\infty}^\infty P(B | X = x) f_X(x)\,dx. \tag{26.2}\] Equation 26.2 is just like its discrete cousin, except with the PMF replaced by the PDF and the sum replaced by an integral.

Substituting the formulas for \(P(B | X = x)\) and \(f_X(x)\) into Equation 26.2, we obtain \[ P(B) = \int_0^\infty e^{-0.4x} \cdot 0.4 e^{-0.4 x}\,dx = \frac{1}{2}, \] which matches the answer that we obtained using a double integral and symmetry in Example 23.7.

Example 26.1 reinforces a point we made after Example 23.7. Double integrals are rarely the most intuitive way to solve most real problems in probability and statistics. If you think carefully about the structure of a problem, you can usually avoid double integrals (or calculus altogether)!

Proposition 26.1 (Laws of Total Probability) Let \(X\) and \(Y\) be random variables.

  1. If \(X\) is a discrete random variable with PMF \(f_X(x)\), then the PMF or PDF of \(Y\) (depending on whether \(Y\) is discrete or continuous, respectively) is given \[ f_Y(y) = \sum_x f_X(x) f_{Y|X}(y|x). \tag{26.3}\]
  2. If \(X\) is a continuous random variable with PDF \(f_X(x)\), then the PMF or PDF of \(Y\) (depending on whether \(Y\) is discrete or continuous, respectively) is described by \[ f_Y(y) = \int_{-\infty}^\infty f_X(x) f_{Y|X}(y|x)\,dx. \tag{26.4}\]

Note that \(f_{Y|X}(y|x)\) denotes a conditional PMF or PDF, depending on whether \(Y\) is discrete or continuous, respectively.

We have already seen several applications of Proposition 26.1 when \(X\) and \(Y\) are both discrete in Chapter 16. So we will focus on the other three cases in the examples below.

First, we use the Law of Total Probability to derive a general formula for the PDF of the sum of two independent continuous random variables. This is both an example of the Law of Total Probability and a useful result in its own right.

Proposition 26.2 (PDF of a sum) Let \(X\) and \(Y\) be independent continuous random variables with PDFs \(f_X\) and \(f_Y\), respectively. What is the PDF of their sum \(T = X + Y\)?

We know the distribution of \(X\). We can determine the conditional distribution of \(T\) given \(X\) as follows:

  • The distribution of \(T | \{ X = x\}\) is the distribution of \(X + Y | \{ X = x\}\) by definition.
  • The distribution of \(X + Y | \{ X = x \}\) is the same as the distribution of \(x + Y\). To see this, replace \(X\) by its “known” value \(x\) and then use independence of \(X\) and \(Y\).

The above argument shows that conditional distribution of \(T | \{ X = x \}\) is the distribution of \(x + Y\), which is a location transform of \(Y\). That is, the conditional PDF of \(T\) given \(X\) is \[ f_{T | X}(t | x) = f_Y(t - x). \]

By Proposition 26.1 (part 2), the (marginal) PDF of \(T\) is \[ f_T(t) = \int_{-\infty}^\infty f_X(x) f_T(t | x)\,dx = \int_{-\infty}^\infty f_X(x) f_Y(t - x)\,dx. \tag{26.5}\]

This formula for the PDF of \(T = X + Y\) is called the convolution formula.

Example 26.2 (Bayes’s billiards balls) In the same paper that introduced Bayes’ rule, Thomas Bayes (1763) considered a binomial random variable \(X\) where the probability of “heads” \(p\) is unknown. We can model this probability as a random variable \(U\), which is equally likely to be any value between \(0\) and \(1\).

  • \(U \sim \textrm{Uniform}(a= 0, b= 1)\)
  • \(X | \{ U = u \} \sim \text{Binomial}(n, p=u)\)

Since \(U\) is continuous and \(X\) is discrete, we are in case 2 of Proposition 26.1, so we use Equation 26.4 to derive the PMF of \(X\):

\[ \begin{aligned} f_X(k) = P(X = k) &= \int_{-\infty}^\infty f_U(u) f_{X|U}(k|u)\,du \\ &= \int_0^1 \binom{n}{k} u^k (1 - u)^{n-k}\,du & k=0, 1, \dots, n \end{aligned} \tag{26.6}\]

Equation 26.6 is a polynomial in \(u\), so for any particular choice of \(n\) and \(k\), we can expand \((1 - u)^{n-k}\) and evaluate the integral. However, it is not obvious how to come up with a general formula.

Bayes realized that the general formula must be \(f_X(k) = \frac{1}{n+1}\). That is, all values \(k=0, 1, \dots, n\) are equally likely. He argued this by imagining \(n+1\) balls being rolled across a table. (Later writers interpreted these to be billiards balls on a pool table.)

Suppose that each of the \(n+1\) balls is equally likely to land anywhere on the opposite side, independently of the other balls. If the opposite side has length \(1\), then the position of the balls are i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\) random variables. To map this situation onto the model above,

  • let \(U\) be the position of the first ball, and
  • let \(X\) be the number of the remaining \(n\) balls that lie to the left of the first ball. Conditional on the position of the first ball \(\{ U = u \}\), each ball has a probability \(u\) of being to the left, and the balls are independent by assumption. Therefore, \(X | \{ U = u\}\) is binomial.

This situation is illustrated in Figure 26.1 for \(n=6\).

Figure 26.1: Illustration of Bayes’ billiards balls, with the first ball highlighted in red.

What is the marginal distribution of \(X\)? Since we are rolling \(n+1\) i.i.d. balls, the first ball is equally likely to be any of the \(n+1\) balls. If it is the leftmost ball, then \(X = 0\); if it is the rightmost ball, then \(X = n\); all values in between are equally likely. Therefore, \[ f_X(k) = \frac{1}{n+1}; \qquad k=0, 1, \dots, n. \]

26.2 Bayes’ Rule

Proposition 26.3 (Bayes’ rule for random variables) Let \(X\) and \(Y\) be random variables. Then,

\[ f_{X | Y}(x|y) = \frac{f_X(x) f_{Y|X}(y|x)}{f_Y(y)}, \tag{26.7}\]

where \(f_X\), \(f_Y\), \(f_{X|Y}\) and \(f_{Y|X}\) are PMFs or PDFs, depending on whether \(X\) and \(Y\) are discrete or continuous.

We will prove this for the case where \(X\) and \(Y\) are both continuous. The case where \(X\) and \(Y\) are both discrete is already covered by Theorem 7.2. The case where \(X\) is discrete and \(Y\) is continuous (or vice versa) is beyond the scope of this book.

We can expand the joint PDF in two ways: \[ f_{X, Y}(x, y) = f_Y(y) f_{X|Y}(x|y) = f_X(x) f_{Y|X}(y|x). \] Dividing both sides by \(f_Y(y)\), we obtain Equation 26.7.

Example 26.3 (Applying Bayes’ rule to Bayes’s billiards balls) Continuing with Example 26.2 suppose we roll \(n+1 = 19\) balls, and \(X = 12\) of the last \(n = 18\) balls end up to the left of the first ball. In light of this information, what can we say about \(U\), the position of the first ball?

Before we knew this information, the position of the first ball was equally likely to be anywhere between \(0\) and \(1\). In light of this information that 2/3 of the 18 remaining balls ended up to the left of the first ball, it seems that the first ball was more likely to be closer to \(1\) than to \(0\).

To make this precise, we determine the conditional PDF of \(U\) given \(X\). To do this, we use Proposition 26.3. \[ \begin{align} f_{U|X}(u|x) &= \frac{f_U(u) f_{X|U}(x|u)}{f_X(x)} \\ &= \frac{1 \cdot \binom{n}{x} u^x(1-u)^{n-x}}{\frac{1}{n+1}} \\ &= (n+1) \binom{n}{x} u^x (1 - u)^{n-x} \end{align} \] for \(0 < u < 1\) and \(x = 0, 1, \dots n\).

Substituting in \(n=18\) and \(x=12\), the conditional distribution of \(U\) is \[ f_{U|X}(u|12) = 19 \binom{18}{12} u^{12} (1 - u)^6; \qquad 0 < u < 1. \]

This PDF is graphed below. As expected, most of the probability is to the right of center.

Figure 26.2: Conditional PDF of \(U\) given \(X = 12\)

We can summarize this distribution by \(\text{E}\!\left[ U|X=12 \right]\), where we expect the first ball to be, given this information.

\[ \begin{align} \text{E}\!\left[ U|X=12 \right] &= \int_{-\infty}^\infty u f_{U|X}(u|12)\,du \\ &= \int_0^1 u\cdot 19 \binom{18}{12} u^{12} (1 - u)^6\,du \\ &= 19 \binom{18}{12} \int_0^1 u^{13} (1 - u)^6\,du \end{align} \] To evaluate this integral, we can use Equation 26.6. By substituting \(n=19\) and \(k=13\), we see that \[ \int_0^1 \binom{19}{13} u^{13} (1 - u)^6\,du = \frac{1}{20}. \]

Therefore: \[ \begin{align} \text{E}\!\left[ U|X=12 \right] &= \frac{19 \binom{18}{12}}{\binom{19}{13}} \int_0^1 \binom{19}{13} u^{13} (1 - u)^6\,du \\ &= \frac{19 \binom{18}{12}}{\binom{19}{13}} \frac{1}{20} \\ &= \frac{13}{20}. \end{align} \]

On average, we expect the first ball to be at \(13/20 = .65\), which is close to but not equal to \(12/18 \approx .667\), the number of balls that ended to the left of the first ball.

26.3 Further Examples

Combined with location-scale transformations, the Law of Total Probability can produce elegant derivations of distributions of random variables.

Example 26.4 (Cauchy distribution) Let \(Z\) and \(W\) be independent \(\textrm{Normal}(\mu= 0, \sigma^2= 1)\) random variables. Then, their ratio \[ X = \frac{Z}{W} \] is said to follow a Cauchy distribution.

What is the PDF of the Cauchy distribution? We can derive this by describing \(X\) conditionally:

  • \(W \sim \textrm{Normal}(\mu= 0, \sigma^2= 1)\)
  • \(X | \{ W = w \} \sim \textrm{Normal}(\mu= 0, \sigma^2= \frac{1}{w^2})\). This is because conditional on \(\{ W = w \}\), \(X = Z / W\) is just a scale transformation of \(Z\), with scaling factor \(\frac{1}{w}\).

Therefore, the PDF of \(X\) is \[ \begin{align} f_X(x) &= \int_{-\infty}^\infty f_W(w) f_{X|W}(x | w)\,dw \\ &= \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}} e^{-w^2/2} \frac{|w|}{\sqrt{2\pi}} e^{-x^2w^2/2} \,dw \\ &= 2 \int_{0}^\infty \frac{1}{\sqrt{2\pi}} e^{-w^2/2} \frac{w}{\sqrt{2\pi}} e^{-x^2w^2/2} \,dw & \text{(by symmetry)} \\ &= \frac{1}{\pi} \int_{0}^\infty w e^{-(1 + x^2) w^2 / 2}\,dw \\ &= \frac{1}{\pi} \int_0^\infty e^{-u}\frac{du}{1 + x^2} & (u = (1 + x^2) w^2 / 2) \\ &= \frac{1}{\pi} \frac{1}{1 + x^2}. \end{align} \tag{26.8}\]

The PDF of the Cauchy distribution is graphed below in red, with the PDF of the standard normal superimposed as a gray dashed line.

The Cauchy distribution has much heavier “tails” than the normal distribution. This results in some strange properties. For example, although the PDF appears to be “centered” at \(0\), the expectation is undefined. This is because the following integral diverges \[ \int_0^\infty \frac{x}{\pi (1 + x^2)}\,dx = \int_0^\infty \frac{1}{2\pi (1 + u)}\,du = \frac{1}{2\pi} \ln(1 + u)\Big|_0^\infty = \infty. \] If the expectation existed, it would have to be \[\begin{align} \text{E}\!\left[ X \right] &= \int_{-\infty}^\infty \frac{x}{\pi (1 + x^2)}\,dx \\ &= \int_{0}^\infty \frac{x}{\pi (1 + x^2)}\,dx + \int_{-\infty}^0 \frac{x}{\pi (1 + x^2)}\,dx \\ &= \infty - \infty, \end{align}\] which is not well-defined.

The Cauchy distribution is a member of an important family of distributions in statistics. The next example introduces another member of this family.

Example 26.5 (The \(t\)-distribution) Let’s determine the PDF of \[ T = \frac{Z}{V}, \] where \(Z\) is \(\textrm{Normal}(\mu= 0, \sigma^2= 1)\) as in Example 26.4, but \(V\) is (independent) \(\textrm{Exponential}(\lambda=1)\).

We can describe \(T\) conditionally:

  • \(V \sim \textrm{Exponential}(\lambda=1)\)
  • \(T | \{ V = v \} \sim \textrm{Normal}(\mu= 0, \sigma^2= \frac{1}{v^2})\). The argument is similar to the one in Example 26.4. Conditional on \(\{ V = v \}\), \(T = Z / V\) is just a scale transformation of \(Z\), with scaling factor \(\frac{1}{v}\).

To get a feel for \(T\), let’s simulate 10000 draws of \(T\). Try varying the value of \(\alpha\) and seeing how the distribution of \(T\) varies.

To calculate the PDF, we apply Proposition 26.1. Since both \(V\) and \(T\) are continuous, we are in case 2 of Proposition 26.1, so we use Equation 26.4 to derive the PDF of \(T\):

\[ \begin{align} f_T(t) &= \int_{-\infty}^\infty f_V(v) f_{T|V}(t|v)\,dv \\ &= \int_{0}^\infty e^{-v} \cdot \frac{\sqrt{v}}{\sqrt{2\pi}} e^{- v t^2 / 2}\,dv \\ &= \int_{0}^\infty \frac{\sqrt{v}}{\sqrt{2\pi}} e^{- (1 + t^2 / 2) v}\,dv \\ &= \frac{1}{\sqrt{2\pi}} \int_{0}^\infty \sqrt{\frac{u}{1 + t^2/2}} e^{-u} \frac{du}{1 + t^2/2} & (u = (1 + t^2/2) v ) \\ &= \frac{1}{\sqrt{2\pi}} \left( 1 + \frac{t^2}{2} \right)^{-3/2} \int_0^\infty \sqrt{u} e^{-u}\,du \end{align} \] Notice that the integral is just \(\text{E}\!\left[ \sqrt{U} \right]\), where \(U\) is a standard exponential random variable. This integral is not easy to evaluate, but it does not depend on \(t\); it is just a constant. Therefore, the PDF of \(T\), up to a scaling constant \(K\), is \[ \begin{align} f_T(t) &= \frac{1}{K} \left(1 + \frac{t^2}{2}\right)^{-3/2}, \end{align} \tag{26.9}\] where \(K\) makes the PDF integrate to one. (It turns out that \(K = \sqrt{8}\).)

Let’s graph this PDF (blue) together with the Cauchy (red) and standard normal (gray) PDFs from Example 26.4.

This PDF is somewhere between the Cauchy and standard normal distributions. In fact, there is an entire family of distributions, \[ f_T(t) = \frac{1}{K_d} \left( 1 + \frac{t^2}{d} \right)^{-(d+1)/2}, \] called the \(t\)-distribution, which is extremely important in statistics.

  • \(d = 1\) corresponds to the Cauchy distribution (Equation 26.8).
  • \(d = 2\) corresponds to Equation 26.9.
  • As \(d \to \infty\), the \(t\)-distribution converges to the standard normal distribution.

The parameter \(d\) is called the degrees of freedom.