In Chapter 33 and Chapter 34, we learned some strategies for determining the exact distribution of a sum or a mean of i.i.d. random variables, but we also saw their limitations. In many situations, it is not feasible to determine the exact distribution, and the best we can do is to approximate the distribution. In general, these approximations will be valid when the sample size \(n\) is large.
In probability and statistics, the study of how random variables behave as \(n \to \infty\) is known as asymptotics. Asymptotics allow us to answer many questions that otherwise would be intractable. Because \(n\) is large in many modern applications, the asymptotic approximation is often very close to the exact answer.
This chapter lays the groundwork for asymptotics, defining precisely what it means for one distribution to approximate another when \(n\) is large. We will apply this theory to sums and means of i.i.d. random variables in Chapter 36, when we discuss the Central Limit Theorem.
35.1 Convergence in Distribution
Suppose we have random variables \(Y_1, Y_2, \dots\) with CDFs \(F_1(y), F_2(y), \dots\), respectively. (We work with the CDF instead of the PMF or PDF because CDFs are defined for both discrete and continuous random variables.) What does it mean to say that the distribution of \(Y_n\) can be approximated by \(F(y)\) as \(n\) gets large? The next definition provides an answer.
Definition 35.1 (Convergence in distribution) Let \(Y_1, Y_2, \dots\) be a sequence of random variables with CDFs \(F_1(y), F_2(y), \dots\). Then, we say that \(Y_n\)converges in distribution to the distribution \(F(y)\) if \[
F_n(y) \to F(y) \quad \text{as } n \to \infty
\] for all \(y\) at which \(F(y)\) is continuous.
We write \(Y_n \stackrel{d}{\to} [\text{distribution}]\). If there is a random variable \(Y\) with that distribution, then we may also write \(Y_n \stackrel{d}{\to} Y\).
Let’s apply Definition 35.1 to the case where \(Y_n = \bar X_n\), the sample mean of \(n\) i.i.d. random variables \(X_1, \dots, X_n\). In Theorem 28.2, we showed that \(\bar X_n\) converges in probability to \(\mu\), where \(\mu \overset{\text{def}}{=}\text{E}\!\left[ X_i \right]\). Now, we will show that \(\bar X_n\) also converges in distribution to \(\mu\).
Theorem 35.1 (Law of Large Numbers (in distribution)) Let \(X_1, X_2, \dots\) be i.i.d. random variables with \(\mu \overset{\text{def}}{=}\text{E}\!\left[ X_i \right]\) and finite variance. Then,
\[
\bar{X}_n \stackrel{d}{\to} \mu.
\tag{35.1}\]
Proof
Observe that \(\mu\) is a constant “random variable”, so its CDF is given by \[
F(y) = \begin{cases} 1 & y \geq \mu \\ 0 & y < \mu \end{cases}.
\tag{35.2}\] We need to show that the CDF of \(\bar X_n\) converges to Equation 35.2 for all \(y \neq \mu\) (because \(F(y)\) is continuous everywhere except \(y = \mu\)).
Now, consider the CDF of \(\bar{X}_n\): \[
F_n(y) \overset{\text{def}}{=}P(\bar{X}_n \leq y).
\] To evaluate the limit of this probability as \(n\to\infty\), recall from Theorem 28.2 that \(\bar X_n\) converges in probability to \(\mu\), so \[
P(|\bar{X}_n - \mu| \geq \varepsilon) \to 0
\] for any \(\varepsilon > 0\).
Since \(0 \leq F_n(y) \leq 1\), this demonstrates that \(F_n(y)\) converges to Equation 35.2 for all \(y \neq \mu\).
We now have two statements of the Law of Large Numbers: Theorem 28.2 says that \(\bar{X}_n\) converges in probability to \(\mu\), while Theorem 35.1 says that \(\bar{X}_n\) converges in distribution to \(\mu\). What is the difference? In general, the two modes of convergence are distinct. However, in the case where the limit is a constant, convergence in probability and convergence in distribution are equivalent.
Theorem 35.2 (Relationship between convergence in probability and convergence in distribution) Let \(Y_1, Y_2, \dots\) be a sequence of random variables. Then, \(Y_n \stackrel{p}{\to} c\) if and only if \(Y_n \stackrel{d}{\to} c\) for some constant \(c\).
Proof
First, suppose \(Y_n \stackrel{p}{\to} c\). In the proof of Theorem 35.1, nowhere did we use anything about \(\bar{X}_n\), other than the fact that it converges in probability to \(\mu\). Therefore, the same argument shows that \(Y_n \stackrel{d}{\to} c\).
Conversely, suppose \(Y_n \stackrel{d}{\to} c\), so the CDFs \(F_n(y)\) converge to \(1\) if \(y > c\) and \(0\) if \(y < c\). Then, \[
\begin{align}
P(|Y_n - c| > \varepsilon) &= P(Y_n > c + \varepsilon) + P(Y_n < c - \varepsilon) \\
&\leq 1 - F_n(c + \varepsilon) + F_n(c - \varepsilon) \\
&\to 1 - 1 + 0 \\
&= 0.
\end{align}
\]
Because convergence in probability and convergence in distribution are equivalent when the limit is a constant, Theorem 28.2 and Theorem 35.1 are one and the same.
The Law of Large Numbers provides reassurance that the sample mean \(\bar X_n\) is a reasonable estimator of \(\mu\). Although we saw in Example 32.4 that \(\bar X_n\) may not always be the “best” estimator, \(\bar X_n\) will approach \(\mu\) as we collect more data. This property is known as consistency.
Definition 35.2 (Consistency of an estimator) Let \(\hat\theta_1, \hat\theta_2, \dots\) be a sequence of estimators for a parameter \(\theta\). We say that \(\hat\theta_n\) is a consistent estimator of \(\theta\) if \[
\hat\theta_n \stackrel{p}{\to} \theta.
\]
35.2 Convergence in Distribution with MGFs
In the examples so far, the sequence of random variables \(Y_1, Y_2, \dots\) have converged in distribution to a constant. Convergence in distribution is more interesting when the limiting distribution is not degenerate.
For example, the code below shows the PMF of a \(\text{Binomial}(n, p)\) random variable \(X_n\). Try increasing \(n\)—what appears to be the limiting distribution?
Although \(X_n\) is discrete for all \(n\), this sequence of random variables appears to “converge” to a normal distribution, which is continuous! However, more work is needed to make this statement precise. Notice that the center of the distribution is drifting towards \(\infty\) as \(n\) increases. This is because the mean \(\text{E}\!\left[ X_n \right] = np\) is increasing. Notice also that the spread of the distribution increases as \(n\) increases. This is because the variance \(\text{Var}\!\left[ X_n \right] = np(1-p)\) is also increasing. Clearly, \(X_n\) diverges as \(n\to\infty\).
In order to make the convergence statement precise, we standardize the random variables, \[
Z_n \overset{\text{def}}{=}\frac{X_n - \text{E}\!\left[ X_n \right]}{\sqrt{\text{Var}\!\left[ X_n \right]}} = \frac{X_n - n}{\sqrt{np(1-p)}},
\] so that each \(Z_n\) has mean \(0\) and variance \(1\). Now, it is plausible that the sequence \(\{ Z_n \}\) converges in distribution. We can show that \(Z_n \stackrel{d}{\to} \textrm{Normal}(\mu= 0, \sigma^2= 1)\), or equivalently that \[
Y_n \overset{\text{def}}{=}\frac{X_n - n}{\sqrt{n}} \stackrel{d}{\to} \textrm{Normal}(\mu= 0, \sigma^2= p(1 - p))
\]
It is virtually impossible to show this directly using Definition 35.1. This is because there is no simple expression for the CDF of the binomial distribution: \[
\begin{align}
F_n(y) \overset{\text{def}}{=}P(Y_n \leq y) &= P\left(X_n \leq y \sqrt{np(1-p)} + np\right) \\
&= \sum_{x=0}^{\lfloor y \sqrt{np(1-p)} + np \rfloor} {n \choose x} p^x (1 - p)^{n-x},
\end{align}
\] so it is hopeless to find the limit of this expression as \(n\to\infty\).
However, recall from Chapter 34 that distributions can also be uniquely specified by their MGFs. It is often easier to take the limit of the MGF. The following result guarantees that if the MGF has a limit, then this limit is the MGF of the limiting distribution.
Theorem 35.3 (Levy-Curtiss continuity theorem) Let \(Y_1, Y_2, \dots\) be a sequence of random variables with MGFs \(M_{Y_1}(t), M_{Y_2}(t), \dots\), respectively. If there exists a function \(M(t)\) such that \[
M_{Y_n}(t) \to M(t)
\] for all \(t\) in an open interval containing \(0\), then \(Y_n\) converges in distribution to the distribution with MGF \(M(t)\).
Theorem 35.3 was first proved by Paul Levy for characteristic functions (Equation 34.5) and extended by John H. Curtiss (1942) to moment generating functions. Curtiss (1909-1977) was an American mathematician and an early advocate for the adoption of computers. He was one of the founders of the Association for Computing Machinery (ACM), which is the largest professional society for computer science today.
We now apply Theorem 35.3 to show that the \(\text{Binomial}(n, p)\) distribution converges to a normal distribution as \(n\to\infty\).
Example 35.1 (Normal approximation to the binomial) Let \(X_n\) be a \(\text{Binomial}(n, p)\) random variable. We will find the limit of \(M_{Y_n}(t)\), where \(Y_n = \frac{X_n - np}{\sqrt{n}}\).
\[
\begin{align}
M_{Y_n}(t) &\overset{\text{def}}{=}\text{E}\!\left[ e^{t Y_n} \right] \\
&= \text{E}\!\left[ e^{t \frac{X_n - np}{\sqrt{n}}} \right] \\
&= \text{E}\!\left[ e^{\frac{t}{\sqrt{n}} X_n} \right] e^{-np \frac{t}{\sqrt{n}}} \\
&= M_{X_n}\left(\frac{t}{\sqrt{n}}\right) e^{-np \frac{t}{\sqrt{n}}} \\
&= \left(1-p + pe^{\frac{t}{\sqrt{n}}}\right)^n e^{-np \frac{t}{\sqrt{n}}},
\end{align}
\] where we used the MGF of the binomial distribution from Exercise 34.1.
Now, we take the limit as \(n\to\infty\). To make the algebra easier, we take logs and reparametrize in terms of \(\delta \overset{\text{def}}{=}\frac{t}{\sqrt{n}}\): \[
\begin{align}
\log M_{Y_n}(t) &= n \log\left(1-p + pe^{\frac{t}{\sqrt{n}}}\right) - np \frac{t}{\sqrt{n}} \\
&= \frac{t^2}{\delta^2} \left( \log(1 - p + pe^{\delta}) - p\delta \right)
\end{align}
\] Now we take the limit as \(\delta \to 0\) (which corresponds to \(n \to \infty\)) using L’Hôpital’s rule (Theorem B.5). \[
\begin{align}
\lim_{\delta\to 0} \log M_{Y_n}(t) &= t^2 \lim_{\delta\to 0} \frac{\log(1 - p + pe^{\delta}) - p\delta}{\delta^2} \\
&= t^2 \lim_{\delta\to 0} \frac{\frac{pe^{\delta}}{1 - p + pe^{\delta}} - p}{2\delta} & \text{(L'H\^{o}pital's rule)} \\
&= t^2 \lim_{\delta\to 0} \frac{pe^{\delta} - p(1 - p + pe^{\delta})}{2\delta} \underbrace{\lim_{\delta \to 0} \frac{1}{1 - p + pe^{\delta}}}_{1} & \text{(properties of limits)} \\
&= t^2 \lim_{\delta\to 0} \frac{pe^{\delta} - p^2 e^{\delta}}{2}& \text{(L'H\^{o}pital again)} \\
&= \frac{t^2}{2} p(1 - p)
\end{align}
\]
Now, undoing the log transformation, this implies that \[
M_{Y_n}(t) \to e^{\frac{t^2}{2} p(1 - p)}.
\] From Example 34.2, we recognize this limit as the MGF of the \(\textrm{Normal}(\mu= 0, \sigma^2= p(1-p))\) distribution. Therefore, by Theorem 35.3, \(Y_n \stackrel{d}{\to} \textrm{Normal}(\mu= 0, \sigma^2= p(1-p))\).
The upshot of Example 35.1 is that we can use the normal distribution to approximate probabilities for a binomial distribution. We say that the binomial distribution is asymptotically normal.
Example 35.2 (Approximating a binomial probability) In Example 29.3, we considered a a skew die with a probability \(p = 0.18\) of landing on a six. If the die is rolled 25 times, what is the probability that six comes up 7 times or more?
The number of times that the die lands on a six \(X\) is a \(\textrm{Binomial}(n= 25, p= 0.18)\) random variable. Assuming that \(n = 25\) is large enough for the approximation in Example 35.1 to be accurate, we know that \[
\frac{X - 25(0.18)}{\sqrt{25(0.18)(1 - 0.18)}} \stackrel{\cdot}{\sim} \text{Normal}(0, 1),
\] or equivalently, \[
X \stackrel{\cdot}{\sim} \textrm{Normal}(\mu= \underbrace{25(0.18)}_{4.5}, \sigma^2= \underbrace{25(0.18)(1 - 0.18)}_{3.69}).
\] (The dot over \(\sim\) indicates that this is only an approximate or asymptotic distribution.)
The code below plots the normal PDF over the binomial PMF. Even though the binomial PMF is quite skewed, the normal curve still seems to fit well!
Now, we use the approximation to calculate the probability: \[
\begin{align}
P(X \geq 7) &= P\left(\frac{X - 4.5}{\sqrt{3.69}} \geq \frac{7 - 4.5}{\sqrt{3.69}} \right) \\
&\approx P\left(Z \geq \frac{7 - 4.5}{\sqrt{3.69}}\right) \\
&= 1 - \Phi\left( \frac{7 - 4.5}{\sqrt{3.69}} \right),
\end{align}
\] where \(Z\) is a standard normal random variable.
The code below compares how close the approximate probability is to the exact answer.
You might notice that the approximation is quite far off. This is because we are approximating a discrete distribution (binomial) by a continuous one (normal). For a continuous distribution, there is no difference between \(P(X \geq 7)\) and \(P(X > 7)\), even though these probabilities are different for a discrete distribution. To ensure that the possibility of \(7\) sixes is included in the continuous approximation, we calculate \(P(X \geq 6.5)\), which is the same for the binomial distribution but different for the normal approximation: \[
\begin{align}
P(X \geq 6.5) &= P\left(\frac{X - 4.5}{\sqrt{3.69}} \geq \frac{6.5 - 4.5}{\sqrt{3.69}} \right) \\
&\approx P\left(Z \geq \frac{6.5 - 4.5}{\sqrt{3.69}}\right) \\
&= 1 - \Phi\left( \frac{6.5 - 4.5}{\sqrt{3.69}} \right),
\end{align}
\]
This small continuity correction makes the approximation accurate to three decimal places!
In the preceding example, we were able to use R to obtain an exact probability (up to floating point precision). Why would anyone settle for an approximation when the exact probability is so easy to obtain? Admittedly, the normal approximation was more important in the age before computers. However, it is still useful today, as shown in the following example.
Example 35.3 (Determining sample size) In Example 35.2, we saw that with \(n = 25\) rolls of a skew die with \(p = 0.18\), the probability that more than \(1/4\) of the rolls are sixes (i.e., \(7\) or more) is about \(0.149\). How many times would we need to roll the die so that this probability is less than \(0.01\)?
Answering this question exactly requires writing code to try different values of \(n\) and calculate the binomial probabilities. It is easier to use the normal approximation.
Now, we set this expression less than \(0.01\) and solve for \(n\). \[
\begin{align}
P(X > n/4) \approx 1 - \Phi\left( \frac{0.07 n}{\sqrt{0.1476 n}} \right) &< 0.01 \\
\frac{0.07 n}{\sqrt{0.1476 n}} &> \Phi^{-1}(0.99) \\
n &> \left(\frac{\Phi^{-1}(0.99) \sqrt{0.1476}}{0.07}\right)^2.
\end{align}
\]
We calculate this in R.
We round up to \(n = 164\) and check that the (exact) probability of rolling more than \(1/4\) sixes is less than \(0.01\), as desired.
35.3 Exercises
Exercise 35.1 (Consistency of the normal variance MLE when mean is known) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\), where \(\mu\) is known (but \(\sigma^2\) is not). Is the MLE that you derived in Exercise 31.5 consistent for \(\sigma^2\)?
Exercise 35.2 (Consistency of the uniform MLE) Is the MLE that you derived in Exercise 30.5 consistent for \(\theta\)?
Hint: You can obtain an explicit expression for \(P(|\hat\theta_n - \theta| > \epsilon)\).
Exercise 35.3 (MSE and Consistency) Let \(\hat\theta_n\) be an estimator for \(\theta\) such that \[
\text{MSE} = \text{E}\!\left[ (\hat\theta_n - \theta)^2 \right] \to 0
\] as \(n\to\infty\).
Show that \(\hat\theta_n\) is consistent for \(\theta\).
Exercise 35.4 (Asymptotic distribution of the minimum of uniforms) Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\) random variables.
Determine the CDF of \[
Y_n \overset{\text{def}}{=}n \min(U_1, \dots, U_n).
\] To what named distribution does \(Y_n\) converge in distribution?
Exercise 35.5 (Poisson approximation to the binomial via MGFs) In Theorem 12.1, we showed that the Poisson distribution was an approximation to the binomial distribution when \(n\) is large and \(p\) is small.
Let \(X_n \sim \text{Binomial}(n, p=\frac{\mu}{n})\). Use MGFs to identify the limiting distribution as \(n \to\infty\).
Exercise 35.6 (Proof of Law of Large Numbers via MGFs) Let \(X_1, \dots, X_n\) be i.i.d. random variables with MGF \(M(t)\).
Exercise 35.7 (Asymptotics for the geometric distribution) Let \(X \sim \text{Geometric}(p=\frac{1}{n})\). Identify the limiting distribution of \(\frac{1}{n} X\) as \(n \to \infty\).
Exercise 35.8 (Normal approximation to the Poisson) Let \(X_n \sim \text{Poisson}(n)\). Show that \[
\frac{X_n - n}{\sqrt{n}} \stackrel{d}{\to} \text{Normal}(0, 1).
\]
Exercise 35.9 (Class sections) The number of students \(S\) who enroll in a course is a \(\textrm{Poisson}(\mu=105)\) random variable. The department has decided that if the enrollment exceeds 120, then the course will be divided into two sections (otherwise, there will only be one section). What is the probability that there will be two sections?
Use R to answer this question exactly.
Use Exercise 35.8 to come up with an approximate answer.
How do the two probabilities compare?
Curtiss, John H. 1942. “A Note on the Theory of Moment Generating Functions.”The Annals of Mathematical Statistics 13 (4): 430–33.