34  Moment Generating Functions

In Chapter 33, we saw how to compute the PMF or PDF of \(X + Y\) when \(X\) and \(Y\) are independent. However, the calculations could be quite involved.

In this section, we will discuss moment generating functions, which among other things, will allow us to determine the distribution of a sum of independent random variables without calculating any complex convolutions.

34.1 Moment Generating Functions

We know that the expected value is not enough to uniquely identify a distribution. For example, there are many distributions with \(\text{E}\!\left[ X \right] = 2\), including

  • \(\textrm{Binomial}(n= 4, p= 0.5)\)
  • \(\textrm{Poisson}(\mu=2)\)
  • \(\textrm{Exponential}(\lambda=1/2)\)
  • \(\textrm{Gamma}(\alpha= 2, \lambda= 1)\)
  • \(\textrm{Normal}(\mu= 2, \sigma^2= 2)\).

But if we also knew that \(\text{E}\!\left[ X^2 \right] = 6\) (or equivalently, \(\text{Var}\!\left[ X \right] = \text{E}\!\left[ X^2 \right] - \text{E}\!\left[ X \right]^2 = 2\)), then that whittles down the possibilities to

  • \(\textrm{Poisson}(\mu=2)\)
  • \(\textrm{Gamma}(\alpha= 2, \lambda= 1)\)
  • \(\textrm{Normal}(\mu= 2, \sigma^2= 2)\).

It turns out that if we knew \(\text{E}\!\left[ X^k \right]\) for every positive integer \(k\), then that would uniquely identify the distribution. \(\text{E}\!\left[ X^k \right]\) is called the \(k\)th moment of the distribution. How do we represent all of the moments of a distribution? We use a function called the moment generating function.

Definition 34.1 Let \(X\) be a random variable. Then, the moment generating function (MGF) of \(X\) is \[\begin{align*} M_X(t) \overset{\text{def}}{=}\text{E}\!\left[ e^{tX} \right] \end{align*} \tag{34.1}\] and uniquely identifies the distribution of \(X\).

Why does the MGF contain information about all of the moments? If we do a Taylor expansion of \(e^{tX}\), we obtain

\[\begin{align*} \text{E}\!\left[ e^{tX} \right] &= \text{E}\!\left[ 1 + tX + \frac{(tX)^2}{2!} + \frac{(tX)^3}{3!} + \dots \right] \\ &= 1 + t \text{E}\!\left[ X \right] + \frac{t^2}{2!} \text{E}\!\left[ X^2 \right] + \frac{t^3}{3!} \text{E}\!\left[ X^3 \right] + \dots. \end{align*} \tag{34.2}\]

The moments are simply the coefficients in the Taylor expansion. The mathematician Herbert Wilf visualized a generating function as a “clothesline” on which we could “hang” a sequence of numbers, like the moments of a distribution.

Figure 34.1: “A generating function is a clothesline on which we hang up a sequence of numbers for display.” – Herbert Wilf

Because the MGF contains information about all of the moments, it uniquely identifies the distribution. The proof of the following theorem is beyond the scope of this book, but it follows from the fact that the MGF is essentially a Laplace transform of the distribution.

Theorem 34.1 (Uniqueness of MGFs) Let \(X\) and \(Y\) be random variables with respective MGFs \(M_X(t)\) and \(M_Y(t)\). If \(M_X(t) = M_Y(t)\), then \(X\) and \(Y\) have the same distribution.

Let’s compute the MGFs for some named distributions. We begin with a discrete example.

Example 34.1 (MGF of Poisson) Let \(X \sim \text{Poisson}(\mu)\). Then, \[ M_X(t) = \sum_{n=0}^\infty e^{tn} \frac{e^{-\mu} \mu^n}{n!} = e^{-\mu} \sum_{n=0}^\infty \frac{\left(\mu e^t \right)^n}{n!} = e^{-\mu} e^{\mu e^t} = e^{\mu(e^t - 1)}. \]

We can also calculate MGFs for continuous distributions.

Example 34.2 (MGF of normal) Let \(X \sim \text{Normal}(\mu, \sigma^2)\). Rather than calculate \(\text{E}\!\left[ e^{tX} \right]\) directly, it is easier to calculate \(\text{E}\!\left[ e^{tZ} \right]\) first, where \(Z\) is standard normal, and then use the fact that \(X = \mu + \sigma Z\).

First, the MGF of a standard normal is \[\begin{align*} M_Z(t) &= \int_{-\infty}^\infty e^{tx} \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \, dx \\ &= \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}} \exp\left\{ - \frac{x^2-2tx}{2} \right\} \, dx \\ &= \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}} \exp\left\{ -\frac{(x-t)^2}{2} + \frac{t^2}{2} \right\} \, dx \\ &= e^{t^2/2} \int_{-\infty}^\infty \underbrace{\frac{1}{\sqrt{2\pi}} e^{-(x-t)^2/2}}_{\text{PDF of $\text{Normal}(t, 1)$}} \, dx \\ &= e^{t^2/2}. \end{align*}\]

Since \(X = \mu + \sigma Z\), we can use properties of expectation and the result above to derive the MGF of \(X\): \[\begin{align*} M_X(t) = \text{E}\!\left[ e^{tX} \right] &= \text{E}\!\left[ e^{t(\mu+\sigma Z)} \right] \\ &= \text{E}\!\left[ e^{t\mu} e^{t\sigma Z} \right] \\ &= e^{t \mu} \text{E}\!\left[ e^{(t \sigma) Z} \right] \\ &= e^{t \mu} M_Z(t \sigma) \\ &= e^{t \mu} e^{(t \sigma)^2/2} \\ &= e^{\mu t + \frac{1}{2} \sigma^2 t^2}. \end{align*}\]

34.2 Generating Moments

We know that the moment generating function contains information about all of the moments. But how do we recover individual moments from the MGF? The name “moment generating function” suggests that it should be possible.

Proposition 34.1 Let \(X\) have MGF \(M_X(t)\). Then, the \(k\)th moment of \(X\) is \[ \text{E}\!\left[ X^k \right] = M_X^{(k)}(0). \tag{34.3}\]

Proof

If we differentiate \(M_X(t)\), we obtain \[ M_X'(t) = \frac{d}{dt} E{e^{tX}} = E{ \frac{d}{dt} e^{tX}} = \text{E}\!\left[ X e^{tX} \right], \] and evaluating at \(t=0\) yields \[ M_X'(0) = \text{E}\!\left[ X e^{0X} \right] = \text{E}\!\left[ X \right]. \]

Taking another derivative, we see that \[ M_X''(t) = \frac{d}{dt} E^{X e^{tX}} = E{ \frac{d}{dt} Xe^{tX}} = \text{E}\!\left[ X^2 e^{tX} \right], \] and evaluating at \(t=0\) again, we see that \[ M_X''(0) = \text{E}\!\left[ X^2 e^{0X} \right] = \text{E}\!\left[ X^2 \right]. \]

The general result can be proved by induction.

Another way to see this result is to observe that the moments are simply the coefficients in the Taylor expansion of \(M_X(t)\) around \(t = 0\) (Equation 34.2), which we know are \(M_X^{(k)}(0)\).

We can use Proposition 34.1 to provide alternative derivations of familiar expectations and variances.

Example 34.3 (Mean and variance of Poisson) Let \(X \sim \text{Poisson}(\mu)\). In Example 34.1, we saw \[ M_X(t) = \exp \left\{ \mu (e^t - 1) \right\}. \] Differentiating once, we get \[ M_X'(t) = \mu e^t \exp \left\{ \mu (e^t - 1) \right\}, \] and differentiating again, we get \[ M_X''(t) = \mu e^t \exp \left\{ \mu (e^t - 1) \right\} + \mu^2 e^{2t} \exp \left\{ \mu (e^t - 1) \right\}. \] Thus, \(\text{E}\!\left[ X \right] = M_X'(0) = \mu\), and \(\text{E}\!\left[ X^2 \right] = M_X ''(0) = \mu + \mu^2\). Hence, \[ \text{Var}\!\left[ X \right] = (\mu + \mu^2) - \mu^2 = \mu. \]

Example 34.4 (Mean and variance of normal) Let \(X \sim \text{Normal}(\mu, \sigma^2)\). In Example 34.2, we saw \[ M_X(t) = e^{\mu t + \frac{1}{2} \sigma^2 t^2}. \] Differentiating, we get \[ M_X'(t) = (\mu + \sigma^2 t) e^{\mu t + \frac{1}{2} \sigma^2 t^2} \] and \[ M_X''(t) = (\mu + \sigma^2 t)^2 e^{\mu t + \frac{1}{2} \sigma^2 t^2} + \sigma^2 e^{\mu t + \frac{1}{2} \sigma^2 t^2}. \] Hence, \(\text{E}\!\left[ X \right] = \mu\) and \(\text{E}\!\left[ X^2 \right] = \mu^2 + \sigma^2\), and thus, \(\text{Var}\!\left[ X \right] = \sigma^2\).

As these examples show, generating moments from the MGF is often messy, and there are usually easier ways to calculate moments. Fortunately, there are more practical uses of the MGF, as we discuss next.

34.3 MGF of a Sum

As promised at the beginning of this chapter, the MGF will allow us to determine the distribution of the sum of independent random variables without doing convolutions. This is because the MGF of a sum is easy to calculate.

Proposition 34.2 Let \(X\) and \(Y\) be independent random variables with respective MGFs \(M_X(t)\) and \(M_Y(t)\). Then, \[ M_{X+Y}(t) = M_X(t) M_Y(t). \]

Proof

\[\begin{align*} M_{X+Y}(t) &= \text{E}\!\left[ e^{t(X+Y)} \right] \\ &= \text{E}\!\left[ e^{tX} e^{tY} \right] \\ &= \text{E}\!\left[ e^{tX} \right] \text{E}\!\left[ e^{tY} \right] & \text{($X$ and $Y$ are independent)} \\ &= M_X(t) M_Y(t). \end{align*}\]

Together with the fact that the MGF uniquely determines the distribution (Theorem 34.1), this fact allows us to determine the distributions of sums without much calculation.

Example 34.5 (Sum of (independent) Poissons with the MGF) Let \(X \sim \text{Poisson}(\mu_1)\) and \(Y \sim \text{Poisson}(\mu_2)\) be independent random variables. In Example 33.2, we used convolutions to show that \(X + Y \sim \text{Poisson}(\mu_1 + \mu_2)\).

Here is another approach that uses MGFs. From Example 34.1, we know that \[ M_{X + Y}(t) = M_X(t) M_Y(t) = e^{\mu_1(e^t - 1)} e^{\mu_2(e^t - 1)}. \]

By grouping terms, we see that \[ M_{X + Y}(t) = e^{(\mu_1 + \mu_2)(e^t - 1)}, \] which we recognize as the MGF of a \(\text{Poisson}(\mu_1 + \mu_2)\) random variable. Therefore, \(X + Y\) must be \(\text{Poisson}(\mu_1 + \mu_2)\) by Theorem 34.1.

Example 34.6 (Sum of (independent) normals with the MGF) Let \(X \sim \text{Normal}(\mu_1, \sigma_1^2)\) and \(Y \sim \text{Normal}(\mu_2, \sigma_2^2)\) be independent. In Example 33.3, we determined the distribution of \(X + Y\) in the special case where \(\mu_1 = \mu_2 = 0\) and \(\sigma_1^2 = \sigma_2^2 = 1\), and it was already quite a complex calculation. However, MGFs will allow us to dispatch the general case with ease.

\[\begin{align*} M_{X+Y}(t) &= M_X(t) M_Y(t) = e^{\mu_1 t + \frac{1}{2} \sigma_1^2 t^2} e^{\mu_2 t + \frac{1}{2} \sigma_2^2 t^2} \\ &= e^{(\mu_1 + \mu_2) t + \frac{1}{2} (\sigma_1^2 + \sigma_2^2) t^2}, \end{align*} \] which we recognize as the MGF of the \(\text{Normal}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\) distribution. Hence, \[ X + Y \sim \text{Normal}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \] by Theorem 34.1.

Although MGFs can be a powerful tool, they are not a replacement for convolutions. In the examples above, we were only able to determine the distribution of \(X + Y\) because we recognized its MGF. In general, \(X + Y\) may not follow a familiar distribution, and the approach above only yields its MGF—not its PMF or PDF, which is usually more useful.

Another problem with MGFs is that they do not always exist. For example, the Cauchy distribution from Example 26.4 does not even have a first moment, so it does not have an MGF. Fortunately, there is a way around this problem. Instead of the MGF, we consider the characteristic function \[ \varphi_X(t) \overset{\text{def}}{=}\text{E}\!\left[ e^{itX} \right], \] where \(i \overset{\text{def}}{=}\sqrt{-1}\) is the imaginary unit. Like the MGF, the characteristic function turns convolutions into products, \[ \varphi_{X + Y}(t) = \varphi_X(t) \varphi_Y(t), \] and uniquely identifies the distribution. (This is because the characteristic function is essentially the Fourier transform of the distribution.) But unlike the MGF, it always exists, including for distributions that do not have moments, like the Cauchy distribution.

Example 34.7 (Sample mean of Cauchy random variables) What happens if we average i.i.d. Cauchy random variables \(X_1, \dots, X_n\)? The answer may be surprising!

To compute the characteristic function for a Cauchy random variable \(X_1\), we would need to evaluate \[ \varphi_{X_1}(t) = \int_{-\infty}^\infty e^{itx} \frac{1}{\pi(1 + x^2)} \,dx, \] which requires techniques from complex analysis and is beyond the scope of this book. But the characteristic function can be shown to be \[ \varphi_{X_1}(t) = e^{-|t|}. \] (Yes, the characteristic function in this case is real-valued, although in general it is complex-valued.)

Now, if we observe i.i.d. random variables \(X_1, \dots, X_n\) from a Cauchy distribution, then the characteristic function of the sample mean \(\bar X\) is \[ \varphi_{\bar X}(t) = \text{E}\!\left[ e^{i (t/n) \sum_{i=1}^n X_i} \right] = \text{E}\!\left[ e^{i (t/n) X_1} \right]^n = \varphi_{X_1}(t/n)^n = \left(e^{-|t/n|}\right)^n = e^{-|t|}, \] which is again the characteristic function of a Cauchy random variable. Therefore, no matter how many i.i.d. Cauchy random variables we average, \(\bar X\) will again follow a Cauchy distribution. This may appear to violate the Law of Large Numbers (Theorem 32.1), which says that \(\bar X\) should converge to a constant as \(n\to\infty\), but the Law of Large Numbers assumes that \(\text{E}\!\left[ X_1 \right] = \mu\), whereas \(\text{E}\!\left[ X_1 \right]\) is undefined for the Cauchy distribution.

34.4 Chernoff Bounds

Moment generating functions arise in many areas of probability and statistics. Here is yet another place where MGFs make an unexpected appearance.

Theorem 34.2 (Chernoff bounds) For any random variable \(X\) and any \(t > 0\),

\[ P(X \geq a) \leq \frac{M_X(t)}{e^{ta}}. \]

In particular, we can choose \(t\) to make the upper bound as small as possible:

\[ P(X \geq a) \leq \inf_{t > 0} \frac{M_X(t)}{e^{ta}}. \]

Proof

Since \(t>0\), \(X \geq a\) is equivalent to \(e^{tX} \geq e^{ta}\). By Markov’s inequality, \[ P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\text{E}\!\left[ e^{tX} \right]}{e^{ta}} = \frac{M_X(t)}{e^{ta}}. \tag{34.4}\]

Although Chernoff bounds are an immediate consequence of Markov’s inequality, they often give much tighter upper bounds, as the next example shows.

Example 34.8 (Comparing Chernoff bounds with Markov’s inequality) Let \(X\) be a \(\textrm{Poisson}(\mu=5)\) random variable. What can we say about \(P(X \geq 20)\)?

Since \(X\) is non-negative, we can apply Markov’s inequality, which gives a bound of \[ P(X \geq 20) \leq \frac{\text{E}\!\left[ X \right]}{20} = 0.25, \] which is technically true but does not give a correct impression about how small this probability is.

On the other hand, the Chernoff bound for the same probability is \[ P(X \geq 20) \leq \inf_{t>0} \frac{M_X(t)}{e^{20t}} = \inf_{t>0} \frac{e^{5(e^t - 1)}}{e^{20t}}, \] where we use the MGF for the Poisson distribution derived in Example 34.1.

We can minimize this quantity over all \(t > 0\) by setting the derivative equal to \(0\): \[ (5e^t - 20) e^{5(e^t - 1) - 20t} = 0, \] which is minimized when \(t = \ln(4)\). Substituting this back into the Chernoff bound above, we obtain \[ P(X \geq 20) \leq \frac{e^{5(e^{\ln(4)} - 1)}}{e^{20 \ln(4)}} \approx 0.000003. \]

The Chernoff bound correctly captures how improbable it is to observe such a large value of \(X\). (In fact, the actual probability is about \(0.00000008\).)

Intuitively, the Chernoff bound is better because it uses information about all of the moments (through the MGF), whereas Markov’s inequality only uses information about the first moment. Markov’s inequality needs to hold for any distribution with \(\text{E}\!\left[ X \right] = 5\), while the Chernoff bound takes advantage of the fact that \(X\) is specifically Poisson.

34.5 Exercises

Exercise 34.1 (MGF of the Bernoulli and binomial distributions)  

  1. Calculate the MGF of the \(\text{Bernoulli}(p)\) distribution.
  2. Use your answer to (a) and Proposition 34.2 to determine the MGF of the \(\text{Binomial}(n, p)\) distribution with minimal calculation.