35  Asymptotic Distributions

In Chapter 33 and Chapter 34, we learned some strategies for determining the exact distribution of a sum or a mean of i.i.d. random variables, but we also saw their limitations. In many situations, it is not feasible to determine the exact distribution, and the best we can do is to approximate the distribution. In general, these approximations will be valid when the sample size \(n\) is large.

In probability and statistics, the study of how random variables behave as \(n \to \infty\) is known as asymptotics. Asymptotics allow us to answer many questions that otherwise would be intractable. Because \(n\) is often large in many modern applications, the asymptotic approximation is very close to the exact answer.

This chapter lays the groundwork for asymptotics, defining precisely what it means for one distribution to approximate another when \(n\) is large. We will apply this theory to sums and means of i.i.d. random variables in Chapter 36, when we discuss the Central Limit Theorem.

35.1 Convergence in Distribution

Suppose we have random variables \(Y_1, Y_2, \dots\) with CDFs \(F_1(y), F_2(y), \dots\), respectively. (We work with the CDF instead of the PMF or PDF because CDFs are defined for both discrete and continuous random variables.) What does it mean to say that the distribution of \(Y_n\) can be approximated by \(F(y)\) as \(n\) gets large? The next definition provides an answer.

Definition 35.1 (Convergence in distribution) Let \(Y_1, Y_2, \dots\) be a sequence of random variables with CDFs \(F_1(y), F_2(y), \dots\). Then, we say that \(Y_n\) converges in distribution to the distribution \(F(y)\) if \[ F_n(y) \to F(y) \quad \text{as } n \to \infty \] for all \(y\) at which \(F(y)\) is continuous.

We write \(Y_n \stackrel{d}{\to} \text{distribution}\). If there is a random variable \(Y\) with that distribution, then we may also write \(Y_n \stackrel{d}{\to} Y\).

Let’s apply Definition 35.1 to the case where \(Y_n = \bar X_n\), the sample mean of \(n\) i.i.d. random variables \(X_1, \dots, X_n\). In Theorem 28.2, we showed that \(\bar X_n\) converges in probability to \(\mu\), where \(\mu \overset{\text{def}}{=}\text{E}\!\left[ X_i \right]\). Now, we will show that \(\bar X_n\) also converges in distribution to \(\mu\).

Theorem 35.1 (Law of Large Numbers (in distribution)) Let \(X_1, X_2, \dots\) be i.i.d. random variables with \(\mu \overset{\text{def}}{=}\text{E}\!\left[ X_i \right]\) and finite variance. Then,

\[ \bar{X}_n \stackrel{d}{\to} \mu. \tag{35.1}\]

Proof

Observe that \(\mu\) is a constant “random variable”, so its CDF is given by \[ F(y) = \begin{cases} 1 & y \geq \mu \\ 0 & y < \mu \end{cases}. \tag{35.2}\] We need to show that the CDF of \(\bar X_n\) converges to Equation 35.2 for all \(y \neq \mu\) (because \(F(y)\) is continuous everywhere except \(y = \mu\)).

Now, consider the CDF of \(\bar{X}_n\): \[ F_n(y) \overset{\text{def}}{=}P(\bar{X}_n \leq y). \] To evaluate this probability, recall from Theorem 28.2 that \(\bar X_n\) converges in probability to \(\mu\), so \[ P(|\bar{X}_n - \mu| \geq \varepsilon) \to 0 \tag{35.3}\] for any \(\varepsilon > 0\). Notice that we can break this probability into two: \[ P(|\bar{X}_n - \mu| \geq \varepsilon) = P(\bar{X}_n - \mu \geq \varepsilon) + P(\bar{X}_n - \mu \leq -\varepsilon), \] so each term individually must also converges to \(0\) as \(n\to\infty\).

Now, consider what happens in each case:

  • If \(y < \mu\), then \[ F_n(y) = P(\bar{X}_n - \mu \leq y - \mu) \to 0, \] where we used the fact that \(y - \mu < 0\).
  • If \(y > \mu\), then \[ \begin{align} F_n(y) &= P(\bar{X}_n - \mu \leq y - \mu) \\ &= 1 - P(\bar{X}_n - \mu > y - \mu) \\ &\to 1, \end{align} \] where we used the fact that \(y - \mu > 0\).

This matches the CDF of the constant \(\mu\) (Equation 35.2) for all \(y \neq \mu\).

We now have two statements of the Law of Large Numbers: Theorem 28.2 says that \(\bar{X}_n\) converges in probability to \(\mu\), while Theorem 35.1 says that \(\bar{X}_n\) converges in distribution to \(\mu\). What is the difference? In general, the two modes of convergence are distinct. However, in the case where the limit is a constant, convergence in probability and convergence in distribution are equivalent.

Theorem 35.2 (Relationship between convergence in probability and convergence in distribution) Let \(Y_1, Y_2, \dots\) be a sequence of random variables. Then, \(Y_n \stackrel{p}{\to} c\) if and only if \(Y_n \stackrel{d}{\to} c\) for some constant \(c\).

Proof

First, suppose \(Y_n \stackrel{p}{\to} c\). In the proof of Theorem 35.1, nowhere did we use anything about \(\bar{X}_n\), other than the fact that it converges in probability to \(\mu\). Therefore, the same argument shows that \(Y_n \stackrel{d}{\to} c\).

Conversely, suppose \(Y_n \stackrel{d}{\to} c\), so the CDFs \(F_n(y)\) converge to \(1\) if \(y > c\) and \(0\) if \(y < c\). Then, \[ \begin{align} P(|Y_n - c| > \varepsilon) &= P(Y_n > c + \varepsilon) + P(Y_n < c - \varepsilon) \\ &\leq 1 - F_n(c + \varepsilon) + F_n(c - \varepsilon) \\ &\to 1 - 1 + 0 \\ &= 0. \end{align} \]

Because convergence in probability and convergence in distribution are equivalent when the limit is a constant, Theorem 28.2 and Theorem 35.1 are one and the same.

The Law of Large Numbers provides assurance that the sample mean \(\bar X_n\) is a reasonable estimator of \(\mu\). Although we saw in Example 32.4 that \(\bar X_n\) may not necessarily be the estimator with the lowest MSE, \(\bar X_n\) will approach \(\mu\) as we collect more data. This property is known as consistency.

Definition 35.2 (Consistency of an estimator) Let \(\hat\theta_1, \hat\theta_2, \dots\) be a sequence of estimators for a parameter \(\theta\). We say that \(\hat\theta_n\) is a consistent estimator of \(\theta\) if \[ \hat\theta_n \stackrel{p}{\to} \theta. \]

35.2 Convergence in Distribution with MGFs

In the examples so far, the sequence of random variables \(Y_1, Y_2, \dots\) have converged in distribution to a constant. Convergence in distribution is more interesting when the limiting distribution is not degenerate.

For example, the code below shows the PMF of a \(\textrm{Poisson}(\mu=n)\) random variable \(X_n\). Try increasing \(n\)—what appears to be the limiting distribution?

Although \(X_n\) is discrete for all \(n\), this sequence of random variables appears to “converge” to a normal distribution, which is continuous! However, more work is needed to make this statement precise. Notice that the center of the distribution is drifting towards \(\infty\) as \(n\) increases. This is because the mean \(\text{E}\!\left[ X_n \right] = n\) is increasing. Notice also that the spread of the distribution increases as \(n\) increases. This is because the variance \(\text{Var}\!\left[ X_n \right] = n\) is also increasing. Clearly, \(X_n\) diverges as \(n\to\infty\).

In order to make the convergence statement precise, we standardize the random variables, \[ Y_n \overset{\text{def}}{=}\frac{X_n - \text{E}\!\left[ X_n \right]}{\sqrt{\text{Var}\!\left[ X_n \right]}} = \frac{X_n - n}{\sqrt{n}}, \] so that each \(Y_n\) has mean \(0\) and variance \(1\). Now, it is plausible that the sequence \(Y_n\) converges in distribution. We will show that \(Y_n \stackrel{d}{\to} \text{Normal}(0, 1)\).

It is virtually impossible to show this directly using Definition 35.1. This is because there is no simple expression for the CDF of the Poisson distribution: \[ F_n(y) \overset{\text{def}}{=}P(Y_n \leq y) = P(X_n \leq y \sqrt{n} + n) = \sum_{x=0}^{\lfloor y \sqrt{n} + n \rfloor} e^{-n} \frac{n^x}{x!}, \] so it is hopeless to find the limit of this expression as \(n\to\infty\).

However, recall from Chapter 34 that distributions can also be uniquely specified by their MGFs. The limit of the MGF is usually easier to find. The following result guarantees that if the MGF has a limit, then this limit is the MGF of the limiting distribution.

Theorem 35.3 (Levy-Curtiss continuity theorem) Let \(Y_1, Y_2, \dots\) be a sequence of random variables with MGFs \(M_{Y_1}(t), M_{Y_2}(t), \dots\), respectively. If there exists a function \(M(t)\) such that \[ M_{Y_n}(t) \to M(t) \] for all \(t\) in an open interval containing \(0\), then \(Y_n\) converges in distribution to the distribution with MGF \(M(t)\).

Theorem 35.3 was first proved by Paul Levy for characteristic functions (Equation 34.4) and extended by John H. Curtiss (1942) to moment generating functions. Curtiss (1909-1977) was an American mathematician and an early advocate for the adoption of computers. He was one of the founders of the Association for Computing Machinery (ACM), which is the largest professional society for computer science today.

We now apply Theorem 35.3 to show that the \(\textrm{Poisson}(\mu=n)\) distribution converges to a normal distribution as \(n\to\infty\).

Example 35.1 (Normal approximation to the Poisson) Let \(S_n\) be a \(\textrm{Poisson}(\mu=n)\) random variable. We will find the limit of \(M_{Y_n}(t)\), where \(Y_n = \frac{S_n - n}{\sqrt{n}}\).

\[ \begin{align} M_{Y_n}(t) &\overset{\text{def}}{=}\text{E}\!\left[ e^{t Y_n} \right] \\ &= \text{E}\!\left[ e^{t \frac{S_n - n}{\sqrt{n}}} \right] \\ &= \text{E}\!\left[ e^{\frac{t}{\sqrt{n}} S_n} \right] e^{- t \sqrt{n}} \\ &= M_{S_n}\Big(\frac{t}{\sqrt{n}}\Big) e^{-t\sqrt{n}} \\ &= e^{n(e^{\frac{t}{\sqrt{n}}} - 1 - \frac{t}{\sqrt{n}})}. \end{align} \]

Now, we take the limit as \(n\to\infty\). To make the algebra easier, we take logs and reparametrize in terms of \(x \overset{\text{def}}{=}\frac{t}{\sqrt{n}}\): \[ \log M_{Y_n}(t) = n \left( e^{\frac{t}{\sqrt{n}}} - 1 - \frac{t}{\sqrt{n}} \right) = \frac{t^2}{x^2} \left( e^x - 1 - x \right). \] Now we take the limit as \(x \to 0\): \[ \begin{align} \lim_{n\to\infty} \log M_{Y_n}(t) &= t^2 \lim_{x \to 0} \frac{e^x - 1 - x}{x^2} & \text{(indeterminate form $0/0$)} \\ &= t^2 \lim_{x \to 0} \frac{e^x - 1}{ 2x } & \text{(L'Hopital's rule, indeterminate form $0/0$)} \\ &= t^2 \lim_{x \to 0} \frac{ e^x }{ 2 } & \text{(L'Hopital's rule again)} \\ &= \frac{t^2}{2}. \end{align} \]

Now, undoing the log transformation, this implies that \[ M_{Y_n}(t) \to e^{\frac{t^2}{2}}. \] From Example 34.2, we recognize this limit as the MGF of the \(\text{Normal}(0, 1)\) distribution. Therefore, by Theorem 35.3, \(Y_n \stackrel{d}{\to} \text{Normal}(0, 1)\).

The upshot of Example 35.1 is that we can use the normal distribution to approximate probabilities for a Poisson distribution. We say that the Poisson distribution is asymptotically normal.

Example 35.2 (Approximating a Poisson probability) The number of students who enroll in a course \(S\) is a \(\textrm{Poisson}(\mu=70)\) random variable. The department has decided that if the enrollment is 80 or more, the course will be divided into two lectures, whereas if fewer than 80 students enroll, there will be one lecture. What is the probability that there will be two lectures?

We can calculate this probability exactly using the Poisson distribution, although this is difficult: \[ P(S \geq 80) = \sum_{x=80}^\infty e^{-70} \frac{70^x}{x!}. \]

However, we can use Example 35.1, which says \[ \frac{S - 70}{\sqrt{70}} \stackrel{\cdot}{\sim} \text{Normal}(0, 1), \] or equivalently, \[ S \stackrel{\cdot}{\sim} \textrm{Normal}(\mu= 70, \sigma^2= 70). \] (The dot over \(\sim\) indicates that this is only an approximate or asymptotic distribution.)

The code below plots the normal PDF over the Poisson PMF. The approximation is very good!

Now, we use the approximation to calculate the probability: \[ \begin{align} P(S > 80) &= P\left(\frac{S - 70}{\sqrt{70}} > \frac{80 - 70}{\sqrt{70}} \right) \\ &\approx P\left(Z > \sqrt{\frac{10}{7}}\right) \\ &= 1 - \Phi\left( \sqrt{\frac{10}{7}} \right). \end{align} \] where \(Z\) is a standard normal random variable.

The code below compares how close the approximate probability is to the exact answer.

In the above example, it was not difficult to use R to obtain an exact probability (or at least one that is very close). Why settle for an approximation? Certainly the normal approximation was necessary in the age before computers, and it is still useful today for proving theoretical results. We will soon encounter problems where an approximation is the only feasible answer.

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise. — John Tukey

35.3 Exercises

Exercise 35.1 (Consistency of the normal variance MLE when mean is known) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\), where \(\mu\) is known (but \(\sigma^2\) is not). Is the MLE that you derived in Exercise 31.4 consistent for \(\sigma^2\)?

Exercise 35.2 (Consistency of the uniform MLE) Is the MLE that you derived in Exercise 30.4 consistent for \(\theta\)?

Hint: You can obtain an explicit expression for \(P(|\hat\theta_n - \theta| > \epsilon)\).

Exercise 35.3 (Poisson approximation to the binomial via MGFs) In Theorem 12.1, we showed that the Poisson distribution was an approximation to the binomial distribution when \(n\) is large and \(p\) is small.

Let \(X_n \sim \text{Binomial}(n, p=\frac{\mu}{n})\). Use MGFs to find the limiting distribution as \(n \to\infty\).

Exercise 35.4 (Asymptotics for the geometric distribution) Let \(X \sim \text{Geometric}(p=\frac{1}{n})\). Find the limiting distribution of \(\frac{1}{n} X\) as \(n \to \infty\).