In the previous part, we learned about estimation theory; in particular, we learned how to find the MLE (maximum likelihood estimator) of an unknown parameter. Additionally, we learned how to compute bias and variance of an estimator; we concluded that the MSE (mean squared error) of an estimator tells us how good it is.
When computing MLEs, we ran across \(\bar{X}\) quite often; in this part, we will be inspecting the various properties of the sample sum and the sample mean.
32.1 Recap
In Proposition 30.1, we saw that, if \(X_1, \dots, X_n\) are i.i.d. with \(\text{E}\!\left[ X_1 \right] = \mu\), then \(\bar{X}\) is an unbiased estimator of \(\mu\); i.e., \[
\text{E}\!\left[ \bar{X} \right] = \mu.
\] We can similarly compute the variance of \(\bar{X}\) quite easily.
Proposition 32.1 (Variance of \(\bar{X}\)) Let \(X_1, \dots, X_n\) be i.i.d. with \(\text{Var}\!\left[ X_1 \right] = \sigma^2\). Then, \[
\text{Var}\!\left[ \bar{X} \right] = \frac{\sigma^2}{n}.
\]
Proof
Since \(X_1, \dots, X_n\) are independent, we see that \[\begin{align*}
\text{Var}\!\left[ \bar{X} \right] &= \text{Var}\!\left[ \frac{1}{n} \sum_{i=1}^n X_i \right] \\
&= \frac{1}{n^2} \text{Var}\!\left[ \sum_{i=1}^n X_i \right] \\
&\stackrel{\text{ind}}{=} \frac{1}{n^2} \sum_{i=1}^n \text{Var}\!\left[ X_i \right] \\
&= \frac{1}{n^2} \sum_{i=1}^n \sigma^2 \\
&= \frac{1}{n^2} \cdot n \sigma^2 \\
&= \frac{\sigma^2}{n}.
\end{align*}\]
32.2 Distribution/consistency of \(\bar{X}\)
We know that, for \(X_1, \dots, X_n\) i.i.d. with \(\text{E}\!\left[ X_1 \right] = \mu\) and \(\text{Var}\!\left[ X_1 \right] = \sigma^2\), \[
\text{E}\!\left[ \bar{X} \right] = \mu \qquad \text{and} \qquad \text{Var}\!\left[ \bar{X} \right] = \frac{\sigma^2}{n}.
\] What does the distribution of \(\bar{X}\) look like? We can consider the following two examples.
Example 32.1 (Sample mean of exponential) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Exponential}(\lambda)\). Then, the MLE for the mean \(\displaystyle \mu = \frac{1}{\lambda}\) is \(\hat{\mu}_{\text{MLE}} = \bar{X}\). We run \(N\) simulations where we take \(n\) samples \(X_1, \dots, X_n\); we then plot the results.
Try changing the value of \(n\) to 10, 100, 1000, and 10000.
Example 32.2 (Sample mean of Poisson) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Poisson}(\lambda)\). Then, the MLE for the mean \(\lambda\) is \(\hat{\lambda}_{\text{MLE}} = \bar{X}\). We run \(N\) simulations where we take \(n\) samples \(X_1, \dots, X_n\); we then plot the results.
Try changing the value of \(n\) to 10, 100, 1000, and 10000.
It is a small sample size of two distributions, but as we increase \(n\), we observe that
the distribution of the sample mean seems to take on the bell curve; and
the distribution of the sample mean tends to concentrate around 5, the mean for both distributions.
We will focus on the second point in this section; we will come back to the first point in Chapter 36.
How can we mathematically write “the sample mean tends to concentrate around the population mean?” One way would be it is extremely unlikely for the distance between \(\bar{X}\) and \(\mu\) to be large; i.e., \[
P( \lvert \bar{X} - \mu \rvert > \varepsilon ) \approx 0
\] as \(n\) gets larger, for any \(\varepsilon > 0\). In fact, we say that \(\bar{X}\) converges to \(\mu\) in probability if \[
P( \lvert \bar{X} - \mu \rvert > \varepsilon ) \to 0
\] as \(n \to \infty\), for any \(\varepsilon > 0\). We denote this as \(\bar{X} \stackrel{p}{\to} \mu\).
It turns out the Weak Law of Large Numbers states this is true for any underlying distribution \(X_1, \dots, X_n\) as long as they are i.i.d.! In statistics, we say that \(\bar{X}\) is a consistent estimator of \(\mu\).
32.3 Markov’s and Chebyshev’s inequalities
Before proving the Weak Law of Large Numbers, we need two inequalities.
Proposition 32.2 (Markov’s inequality) If \(X\) is a nonnegative random variable, then \[
P(X \geq a) \leq \frac{\text{E}\!\left[ X \right]}{a}
\] for any \(a > 0\).
Proof
For \(a > 0\), let \[
I_{X \geq a} = \begin{cases} 1, & X \geq a \\ 0, & X < a \end{cases}.
\] In other words, \(I_{X \geq a}\) is the indicator of whether \(X \geq a\) or not. Then, \(\displaystyle I_{X \geq a} \leq \frac{X}{a}\), and so, \[
\text{E}\!\left[ I_{X \geq a} \right] \leq \text{E}\!\left[ \frac{X}{a} \right] = \frac{\text{E}\!\left[ X \right]}{a}.
\] However, \(\text{E}\!\left[ I_{X \geq a} \right] = P(X \geq a)\), and the result follows.
Chebyshev’s inequality follows immediately from Markov’s inequality.
Proposition 32.3 (Chebyshev’s inequality) If \(X\) is a random variable with mean \(\mu\) and variance \(\sigma^2\), then \[
P(\lvert X - \mu \rvert \geq k) \leq \frac{\sigma^2}{k^2}
\] for any \(k > 0\).
Proof
Let \(k > 0\) be arbitrary. Since \((X - \mu)^2\) is a nonnegative random variable, we can use Proposition 32.2 to get \[
P( (X - \mu)^2 \geq k^2 ) \leq \frac{\text{E}\!\left[ (X - \mu)^2 \right]}{k^2}.
\] Note that the events \(\left\{ (X - \mu)^2 \geq k^2 \right\}\) and \(\left\{ \lvert X - \mu \rvert \geq k \right\}\) are equivalent. Also noting that \(\text{E}\!\left[ (X - \mu)^2 \right] = \sigma^2\) yields the desired result \[
P( \lvert X - \mu \rvert \geq k ) \leq \frac{\sigma^2}{k^2}.
\]
32.4 Weak Law of Large Numbers
We are finally ready to prove the main result of this chapter!
Theorem 32.1 (Weak Law of Large Numbers) Let \(X_1, \dots, X_n\) be i.i.d. with \(\text{E}\!\left[ X_1 \right] = \mu\) and \(\text{Var}\!\left[ X_1 \right] = \sigma^2\). Then, \[
P( \lvert \bar{X} - \mu \rvert > \varepsilon) \to 0 \qquad \text{as $n \to \infty$}
\] for any \(\varepsilon > 0\). In other words, \(\bar{X}\) converges to \(\mu\) in probability.
Proof
Let \(\varepsilon > 0\) be arbitrary. Since \(\text{E}\!\left[ \bar{X} \right] = \mu\) and \(\text{Var}\!\left[ \bar{X} \right] = \sigma^2/n\), \[
P( \lvert \bar{X} - \mu \rvert > \varepsilon ) \leq \frac{\sigma^2/n}{\varepsilon^2} = \frac{1}{n} \cdot \frac{\sigma^2}{\varepsilon^2}
\] by Proposition 32.3.
Hence, it follows that \[
P( \lvert \bar{X} - \mu \rvert > \varepsilon ) \to 0
\] as \(n \to \infty\).
One consequence of the Weak Law of Large Numbers is the consistency of \(\bar{X}\) – the larger the sample size, the closer the sample mean is guaranteed to be to the population mean.