34 Moment Generating Functions

In Chapter 33, we learned convolution, which is a method to compute the PMF or PDF of $X + Y$ when $X$ and $Y$ are independent. Although in principle, we could use repeated convolutions to determine the distribution of $\bar X_n$, the sample mean of $n$ i.i.d. random variables, it is often prohibitive in practice.

In this section, we will discuss moment generating functions, which among other uses, allow us to determine the distribution of a sum of independent random variables without any complicated convolutions.

34.1 Moment Generating Functions

We have seen how the distribution of a random variable can be specified by a PMF (for discrete random variables), PDF (for continuous random variables), or CDF (for any random variable). These specifications are equivalent; if one knows the PDF of a (continuous) random variable, one can determine its CDF, and vice versa.

The moment generating function $M_X(t)$ is an alternative way of specifying the distribution of a random variable $X$. It is based on the values of the $k$th moment $\text{E}\!\left[ X^k \right]$, for $k=1, 2, \dots$.

We know that one moment is not enough to uniquely identify a distribution. For example, if we only knew that the first moment (a.k.a. expected value) of $X$ was $2$, $X$ could follow a number of distributions, including

$\textrm{Binomial}(n= 4, p= 0.5)$
$\textrm{Poisson}(\mu=2)$
$\textrm{Exponential}(\lambda=1/2)$
$\textrm{Normal}(\mu= 2, \sigma^2= 2)$.

But if we also knew the that the second moment was $\text{E}\!\left[ X^2 \right] = 6$ (or equivalently, $\text{Var}\!\left[ X \right] = \text{E}\!\left[ X^2 \right] - \text{E}\!\left[ X \right]^2 = 2$), then this whittles down the possibilities to

$\textrm{Poisson}(\mu=2)$
$\textrm{Normal}(\mu= 2, \sigma^2= 2)$.

If we knew higher moments, such as $\text{E}\!\left[ X^3 \right]$ and $\text{E}\!\left[ X^4 \right]$, then the set of possible distributions would be smaller still. The moment generating function is a single function that captures information about all moments $\text{E}\!\left[ X^k \right]$ for $k=1, 2, \dots$, as we will see.

Definition 34.1 (Moment generating function) Let $X$ be a random variable. Then, the moment generating function (MGF) of $X$ is \[\begin{align*} M_X(t) \overset{\text{def}}{=}\text{E}\!\left[ e^{tX} \right], \end{align*} \tag{34.1}\] provided that this expectation exists for all $t$ in an open interval containing $0$.

The last condition is technical, but it explains why the MGF contains information about all of the moments. It allows us to do a Taylor expansion of $M(t)$ around $t = 0$:

\[\begin{align*} \text{E}\!\left[ e^{tX} \right] &= \text{E}\!\left[ 1 + tX + \frac{(tX)^2}{2!} + \frac{(tX)^3}{3!} + \dots \right] \\ &= 1 + t \text{E}\!\left[ X \right] + \frac{t^2}{2!} \text{E}\!\left[ X^2 \right] + \frac{t^3}{3!} \text{E}\!\left[ X^3 \right] + \dots. \end{align*} \tag{34.2}\]

The moments are simply the coefficients in the Taylor expansion. The mathematician Herbert Wilf visualized a generating function as a “clothesline” on which we could “hang” a sequence of numbers, like the moments of a distribution.

Figure 34.1: “A generating function is a clothesline on which we hang up a sequence of numbers for display.” – Herbert Wilf

Because the MGF contains information about all of the moments, it is not surprising that it (usually) uniquely identifies the distribution. The proof of the following theorem is beyond the scope of this book, but it follows from the fact that the MGF is essentially a Laplace transform of the distribution.

Theorem 34.1 (Uniqueness of MGFs) Let $X$ and $Y$ be random variables with respective MGFs $M_X(t)$ and $M_Y(t)$. If $M_X(t) = M_Y(t)$ for all $t$ in an open interval containing $0$, then $X$ and $Y$ have the same distribution.

Let’s compute the MGFs for some named distributions. We begin with a discrete example.

Example 34.1 (MGF of Poisson) Let $X \sim \text{Poisson}(\mu)$. Then, \[ M_X(t) = \sum_{n=0}^\infty e^{tn} \frac{e^{-\mu} \mu^n}{n!} = e^{-\mu} \sum_{n=0}^\infty \frac{\left(\mu e^t \right)^n}{n!} = e^{-\mu} e^{\mu e^t} = e^{\mu(e^t - 1)}. \]

We can also calculate MGFs for continuous distributions.

Example 34.2 (MGF of normal) Let $X \sim \text{Normal}(\mu, \sigma^2)$. Rather than calculate $\text{E}\!\left[ e^{tX} \right]$ directly, it is easier to calculate $\text{E}\!\left[ e^{tZ} \right]$ first, where $Z$ is standard normal, and then use the fact that $X = \mu + \sigma Z$.

First, the MGF of a standard normal is \[\begin{align*} M_Z(t) &= \int_{-\infty}^\infty e^{tx} \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \, dx \\ &= \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}} \exp\left\{ - \frac{x^2-2tx}{2} \right\} \, dx \\ &= \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}} \exp\left\{ -\frac{(x-t)^2}{2} + \frac{t^2}{2} \right\} \, dx \\ &= e^{t^2/2} \int_{-\infty}^\infty \underbrace{\frac{1}{\sqrt{2\pi}} e^{-(x-t)^2/2}}_{\text{PDF of $\text{Normal}(t, 1)$}} \, dx \\ &= e^{t^2/2}. \end{align*}\]

Since $X = \mu + \sigma Z$, we can use properties of expectation and the result above to derive the MGF of $X$: \[\begin{align*} M_X(t) = \text{E}\!\left[ e^{tX} \right] &= \text{E}\!\left[ e^{t(\mu+\sigma Z)} \right] \\ &= \text{E}\!\left[ e^{t\mu} e^{t\sigma Z} \right] \\ &= e^{t \mu} \text{E}\!\left[ e^{(t \sigma) Z} \right] \\ &= e^{t \mu} M_Z(t \sigma) \\ &= e^{t \mu} e^{(t \sigma)^2/2} \\ &= e^{\mu t + \frac{1}{2} \sigma^2 t^2}. \end{align*}\]

The MGFs above were defined for all real-valued numbers $t$. The next example shows that this is not necessarily the case, which is why the technical condition “for all $t$ in an open interval containing $0$” is necessary in the definition of the MGF.

Example 34.3 (MGF of exponential) Let $X \sim \text{Exponential}(\lambda)$. We first write down the integral corresponding to the MGF and simplify:

\[\begin{align*} M_X(t) = \text{E}\!\left[ e^{tX} \right] &= \int_0^\infty e^{tx} \lambda e^{-\lambda x}\,dx \\ &= \int_0^\infty \lambda e^{-(\lambda - t)x}\,dx. \end{align*}\]

If $t \geq \lambda$, then the exponent (in the integrand) will be positive, and this integral will diverge. Therefore, the MGF is only defined when $t < \lambda$, in which case it is \[ M_X(t) = \frac{\lambda}{\lambda - t} \int_0^\infty \underbrace{(\lambda - t) e^{-(\lambda - t) x}}_{\text{PDF of $\text{Exponential}(\lambda - t)$}} \,dx = \frac{\lambda}{\lambda - t}; \qquad t < \lambda. \]

Since $\lambda > 0$, this means that there exists an open interval containing $0$ on which this MGF is defined. Therefore, Theorem 34.1 applies, and this MGF uniquely characterizes the exponential distribution.

34.2 Generating Moments

We know that the moment generating function contains information about all of the moments. But how do we recover individual moments from the MGF? The name “moment generating function” suggests that it should be possible.

Proposition 34.1 (Generating moments from the MGF) Let $X$ have MGF $M_X(t)$. Then, the $k$th moment of $X$ is \[ \text{E}\!\left[ X^k \right] = M_X^{(k)}(0). \tag{34.3}\]

Proof

If we differentiate $M_X(t)$, we obtain \[ M_X'(t) = \frac{d}{dt} \text{E}\!\left[ e^{tX} \right] = \text{E}\!\left[ \frac{d}{dt} e^{tX} \right] = \text{E}\!\left[ X e^{tX} \right], \] and evaluating at $t=0$ yields \[ M_X'(0) = \text{E}\!\left[ X e^{0X} \right] = \text{E}\!\left[ X \right]. \]

Taking another derivative, we see that \[ M_X''(t) = \frac{d}{dt} \text{E}\!\left[ X e^{tX} \right] = \text{E}\!\left[ \frac{d}{dt} Xe^{tX} \right] = \text{E}\!\left[ X^2 e^{tX} \right], \] and evaluating at $t=0$ again, we see that \[ M_X''(0) = \text{E}\!\left[ X^2 e^{0X} \right] = \text{E}\!\left[ X^2 \right]. \]

The general result can be proved by induction.

Another way to see this result is to observe that the moments are simply the coefficients in the Taylor expansion of $M_X(t)$ around $t = 0$ (Equation 34.2), which we know are $M_X^{(k)}(0)$.

We can use Proposition 34.1 to provide alternative derivations of familiar expectations and variances.

Example 34.4 (Mean and variance of Poisson) Let $X \sim \text{Poisson}(\mu)$. In Example 34.1, we saw \[ M_X(t) = \exp \left\{ \mu (e^t - 1) \right\}. \] Differentiating once, we get \[ M_X'(t) = \mu e^t \exp \left\{ \mu (e^t - 1) \right\}, \] and differentiating again, we get \[ M_X''(t) = \mu e^t \exp \left\{ \mu (e^t - 1) \right\} + \mu^2 e^{2t} \exp \left\{ \mu (e^t - 1) \right\}. \] Thus, $\text{E}\!\left[ X \right] = M_X'(0) = \mu$, and $\text{E}\!\left[ X^2 \right] = M_X ''(0) = \mu + \mu^2$. Hence, \[ \text{Var}\!\left[ X \right] = (\mu + \mu^2) - \mu^2 = \mu. \]

Example 34.5 (Mean and variance of normal) Let $X \sim \text{Normal}(\mu, \sigma^2)$. In Example 34.2, we saw \[ M_X(t) = e^{\mu t + \frac{1}{2} \sigma^2 t^2}. \] Differentiating, we get \[ M_X'(t) = (\mu + \sigma^2 t) e^{\mu t + \frac{1}{2} \sigma^2 t^2} \] and \[ M_X''(t) = (\mu + \sigma^2 t)^2 e^{\mu t + \frac{1}{2} \sigma^2 t^2} + \sigma^2 e^{\mu t + \frac{1}{2} \sigma^2 t^2}. \] Hence, $\text{E}\!\left[ X \right] = \mu$ and $\text{E}\!\left[ X^2 \right] = \mu^2 + \sigma^2$, and thus, $\text{Var}\!\left[ X \right] = \sigma^2$.

As these examples show, generating moments from the MGF is often messy, and there are usually easier ways to calculate moments. Fortunately, there are more practical uses of the MGF, as we discuss next.

34.3 MGF of a Sum

As promised at the beginning of this chapter, the MGF allows us to determine the distribution of the sum of independent random variables without doing convolutions. This is because the MGF of a sum is easy to calculate.

Proposition 34.2 (MGF of a sum) Let $X$ and $Y$ be independent random variables with respective MGFs $M_X(t)$ and $M_Y(t)$, for all $t$ in an open interval containing $0$. Then, over that same interval, \[ M_{X+Y}(t) = M_X(t) M_Y(t). \]

Proof

\[\begin{align*} M_{X+Y}(t) &= \text{E}\!\left[ e^{t(X+Y)} \right] \\ &= \text{E}\!\left[ e^{tX} e^{tY} \right] \\ &= \text{E}\!\left[ e^{tX} \right] \text{E}\!\left[ e^{tY} \right] & \text{($X$ and $Y$ are independent)} \\ &= M_X(t) M_Y(t). \end{align*}\]

Together with the fact that the MGF uniquely determines the distribution (Theorem 34.1), this fact allows us to determine the distributions of sums without much calculation.

Example 34.6 (Sum of (independent) Poissons with the MGF) Let $X \sim \text{Poisson}(\mu_1)$ and $Y \sim \text{Poisson}(\mu_2)$ be independent random variables. In Example 33.1, we used convolutions to show that $X + Y \sim \text{Poisson}(\mu_1 + \mu_2)$.

Here is another approach that uses MGFs. From Example 34.1, we know that \[ M_{X + Y}(t) = M_X(t) M_Y(t) = e^{\mu_1(e^t - 1)} e^{\mu_2(e^t - 1)}. \]

By grouping terms, we see that \[ M_{X + Y}(t) = e^{(\mu_1 + \mu_2)(e^t - 1)}, \] which we recognize as the MGF of a $\text{Poisson}(\mu_1 + \mu_2)$ random variable. Therefore, $X + Y$ must be $\text{Poisson}(\mu_1 + \mu_2)$ by Theorem 34.1.

Example 34.7 (Sum of (independent) normals with the MGF) Let $X \sim \text{Normal}(\mu_1, \sigma_1^2)$ and $Y \sim \text{Normal}(\mu_2, \sigma_2^2)$ be independent. In Example 33.2, we determined the distribution of $X + Y$ in the special case where $\mu_1 = \mu_2 = 0$ and $\sigma_1^2 = \sigma_2^2 = 1$, and it was already quite a complex calculation. However, MGFs will allow us to dispatch the general case with ease.

\[\begin{align*} M_{X+Y}(t) &= M_X(t) M_Y(t) = e^{\mu_1 t + \frac{1}{2} \sigma_1^2 t^2} e^{\mu_2 t + \frac{1}{2} \sigma_2^2 t^2} \\ &= e^{(\mu_1 + \mu_2) t + \frac{1}{2} (\sigma_1^2 + \sigma_2^2) t^2}, \end{align*} \] which we recognize as the MGF of the $\text{Normal}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$ distribution. Hence, \[ X + Y \sim \text{Normal}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \tag{34.4}\] by Theorem 34.1.

We can use this result to determine the distribution of $\bar X_n$, the sample mean of i.i.d. $\text{Normal}(\mu, \sigma^2)$ random variables. If we repeatedly apply Equation 34.4, we can determine the distribution of their sample sum \[ S_n = X_1 + \cdots + X_n \sim \text{Normal}(n\mu, n\sigma^2), \] so the sample mean, which differs only by a scaling, must also have a normal distribution. \[ \bar X_n = \frac{1}{n} S_n \sim \text{Normal}(\mu, \frac{\sigma^2}{n}) \]

34.4 Limitations and Extensions

Although MGFs can be a powerful tool, they are not a replacement for convolutions. In the examples above, we were only able to determine the distribution of $S = X + Y$ because we recognized its MGF. In general, $S$ may not follow a familiar distribution, and the approach above only yields its MGF—not its PMF or PDF, which is usually more useful. This point is illustrated by the next example.

Example 34.8 (Sum of independent exponentials) In Example 33.4, we derived the distribution of $S = X+Y$, where $X$ and $Y$ be independent $\text{Exponential}(\lambda)$ random variables.

Could we do this using MGFs? In Example 34.3, we showed that the MGF of $X$ (and $Y$) is \[ M_X(t) = \frac{\lambda}{\lambda - t}, \qquad t < \lambda. \]

By Proposition 34.2, the MGF of $S = X + Y$ is
\[ M_S(t) = M_X(t) M_Y(t) = \left(\frac{\lambda}{\lambda - t}\right)^2, \qquad t < \lambda. \]

Although this is indeed the MGF of $S$, it is not obvious what the distribution of $S$ is because this expression does not correspond to the exponential MGF or any of the MGFs we have seen so far, and there is no easy way to recover the PDF or CDF from the MGF.

Another problem with MGFs is that they do not always exist. For example, the random variable $X$ in the St. Petersburg paradox (Example 9.6) has first moment $\text{E}\!\left[ X \right] = \infty$, so it does not have an MGF. Perhaps surprisingly, the MGF can fail to exist even when moments of all orders exist and are finite. In situations like this, it is possible for two different distributions to have matching moments, as you will show in Exercise 34.6.

Fortunately, there is a way around the second problem. Instead of the MGF, we consider the characteristic function \[ \varphi_X(t) \overset{\text{def}}{=}\text{E}\!\left[ e^{itX} \right], \tag{34.5}\] where $i \overset{\text{def}}{=}\sqrt{-1}$ is the imaginary unit. Like the MGF, the characteristic function turns convolutions into products, \[ \varphi_{X + Y}(t) = \varphi_X(t) \varphi_Y(t), \] and uniquely identifies the distribution. (This is because the characteristic function is essentially the Fourier transform of the distribution.) But unlike the MGF, it always exists, including for distributions whose moments do not exist.

Example 34.9 (Sample mean of Cauchy random variables) What happens if we average i.i.d. Cauchy random variables $X_1, \dots, X_n$? Recall from Example 19.5 that the expected value (i.e., first moment) of the Cauchy distribution is undefined, so it cannot have an MGF.

To compute the characteristic function for a Cauchy random variable $X_1$, we would need to evaluate \[ \varphi_{X_1}(t) = \int_{-\infty}^\infty e^{itx} \frac{1}{\pi(1 + x^2)} \,dx, \] which requires techniques from complex analysis and is beyond the scope of this book. But the characteristic function can be shown to be \[ \varphi_{X_1}(t) = e^{-|t|}. \] (Yes, the characteristic function in this case is real-valued, although in general it is complex-valued.)

Now, if we observe i.i.d. random variables $X_1, \dots, X_n$ from a Cauchy distribution, then the characteristic function of the sample mean $\bar X$ is \[ \varphi_{\bar X}(t) = \text{E}\!\left[ e^{i (t/n) \sum_{i=1}^n X_i} \right] = \text{E}\!\left[ e^{i (t/n) X_1} \right]^n = \varphi_{X_1}(t/n)^n = \left(e^{-|t/n|}\right)^n = e^{-|t|}, \] which is again the characteristic function of a Cauchy random variable. Therefore, no matter how many i.i.d. Cauchy random variables we average, $\bar X$ will again follow a Cauchy distribution. This may appear to violate the Law of Large Numbers (Theorem 28.2), which says that $\bar X$ should converge to a constant as $n\to\infty$, but the Law of Large Numbers assumes that $\text{E}\!\left[ X_1 \right] = \mu$, whereas $\text{E}\!\left[ X_1 \right]$ is undefined for the Cauchy distribution.

34.5 Chernoff Bounds

Moment generating functions arise in many areas of probability and statistics. Here is yet another place where MGFs make an unexpected appearance.

Theorem 34.2 (Chernoff bounds) For any random variable $X$ and any $t > 0$,

\[ P(X \geq a) \leq \frac{M_X(t)}{e^{ta}}. \]

In particular, we can choose $t$ to make the upper bound as small as possible:

\[ P(X \geq a) \leq \inf_{t > 0} \frac{M_X(t)}{e^{ta}}. \]

Proof

Since $t>0$, $e^{tx}$ is a strictly increasing and positive function. Therefore, $X \geq a$ is equivalent to $e^{tX} \geq e^{ta}$. By Markov’s inequality (Theorem 28.1), \[ P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\text{E}\!\left[ e^{tX} \right]}{e^{ta}} = \frac{M_X(t)}{e^{ta}}. \tag{34.6}\]

Although Chernoff bounds are an immediate consequence of Markov’s inequality, they often give much tighter upper bounds, as the next example shows.

Example 34.10 (Comparing Chernoff bounds with Markov’s inequality) Let $X$ be a $\textrm{Poisson}(\mu=5)$ random variable. What can we say about $P(X \geq 20)$?

Since $X$ is non-negative, we can apply Markov’s inequality, which gives a bound of \[ P(X \geq 20) \leq \frac{\text{E}\!\left[ X \right]}{20} = 0.25, \] which is technically true but does not give a correct impression about how small this probability is.

On the other hand, the Chernoff bound for the same probability is \[ P(X \geq 20) \leq \inf_{t>0} \frac{M_X(t)}{e^{20t}} = \inf_{t>0} \frac{e^{5(e^t - 1)}}{e^{20t}}, \] where we use the MGF for the Poisson distribution derived in Example 34.1.

We can minimize this quantity over all $t > 0$ by setting the derivative equal to $0$: \[ (5e^t - 20) e^{5(e^t - 1) - 20t} = 0, \] which is minimized when $t = \ln(4)$. Substituting this back into the Chernoff bound above, we obtain \[ P(X \geq 20) \leq \frac{e^{5(e^{\ln(4)} - 1)}}{e^{20 \ln(4)}} \approx 0.000003. \]

The Chernoff bound correctly captures how improbable it is to observe such a large value of $X$. (In fact, the actual probability is about $0.00000008$.)

Intuitively, the Chernoff bound is better because it uses information about all of the moments (through the MGF), whereas Markov’s inequality only uses information about the first moment. Markov’s inequality needs to hold for any distribution with $\text{E}\!\left[ X \right] = 5$, while the Chernoff bound takes advantage of the fact that $X$ is specifically Poisson.

34.6 Exercises

Exercise 34.1 (MGF of the Bernoulli and binomial distributions)

Calculate the MGF of the $\text{Bernoulli}(p)$ distribution.
Use your answer to (a) and Proposition 34.2 to determine the MGF of the $\text{Binomial}(n, p)$ distribution with minimal calculation.

Exercise 34.2 (MGF of the uniform distribution)

Calculate the MGF of the $\textrm{Uniform}(a= 0, b= 1)$ distribution.
Use your answer to (a) and location-scale transformations to derive the MGF of the $\text{Uniform}(a, b)$ distribution.

Exercise 34.3 (MGF of a sum of uniforms) Consider the random variable $S$ from Example 33.3 with PDF \[ f_S(s) = \begin{cases} s, & 0 < s < 1 \\ 2-s, & 1 \leq s < 2 \\ 0, & \text{otherwise} \end{cases}. \]

Find the MGF of $S$.
Find the MGF of $X, Y \sim \text{Uniform}(0,1)$ and confirm that $S = X + Y$.

Exercise 34.4 (MGF of the negative binomial distribution) Let $X \sim \text{NegativeBinomial}(r,p)$.

Calculate $M_X(t)$ using the definition of MGF. The identity \[ \sum_{m=0}^\infty \binom{m+r-1}{m} x^m = \frac{1}{(1-x)^r} \] should be useful.
Calculate $M_X(t)$ in another way, using the fact that \[ X = Y_1 + \cdots + Y_r, \] where $Y_1, \dots, Y_r$ are i.i.d. $\text{Geometric}(p)$. (Hint: First derive the MGF of the geometric distribution.)

Exercise 34.5 (Mixture of exponentials) Let $X$ be a random variable with PDF \[ f_X(x) = \begin{cases} e^{-2x} + \frac{1}{2} e^{-x} & x > 0 \\ 0 & x \leq 0 \end{cases}. \] Compute the MGF of $X$ and use it to compute $\text{E}\!\left[ X \right]$ and $\text{Var}\!\left[ X \right]$.

Remark. Note that the mean and variance are computable directly (via multiple uses of integration by parts), but using MGFs is simpler.

Exercise 34.6 (A distribution with no MGF) Let $Z$ be $\text{Normal}(\mu=0, \sigma^2=1)$, and let $X = e^{Z}$. $X$ is said to follow a lognormal distribution. In this exercise, you will show that the lognormal distribution does not have an MGF. You will further show that there is a different distribution whose moments match the lognormal distribution. (This would be impossible if the MGF of $X$ existed.)

Calculate the PDF of $X$.
Determine a general formula for $\text{E}\!\left[ X^k \right]$ for $k=1, 2, \dots$. (Hint: You can avoid integrals if you use the known MGF of $Z$.)
Explain why your answer to b implies that the MGF of $X$ does not exist. (Hint: What would happen to the Taylor series in Equation 34.2?)
Let $Y$ be a random variable with PDF \[ f_Y(y) = \frac{1}{\sqrt{2\pi} y} e^{-\log(y)^2/2} \left(1 + \sin(2\pi\log(y)) \right); \qquad 0 \leq y < \infty. \] Show that the moments of $Y$ match the moments of $X$, yet the two random variables have different distributions.