We have seen that when we have \(X_1, \dots, X_n\) i.i.d. from any distribution, a general estimator for \(\mu = \text{E}\!\left[ X_1 \right]\) is \(\bar X\). Although not always optimal, \(\bar X\) is unbiased and consistent for \(\mu\).
Is there a similar estimator for \(\sigma^2 = \text{Var}\!\left[ X_1 \right]\)? We explore this question in this chapter.
A Preliminary Estimator
Since \(\text{Var}\!\left[ X_1 \right] \overset{\text{def}}{=}\text{E}\!\left[ (X_1 - \mu)^2 \right]\), one idea is to replace all of the expectations by their sample averages. That is, we plug in \(\bar X\) for \(\mu\) and estimate the variance as
\[
\hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2.
\tag{38.1}\] Is this a good estimator? The following lemma will help.
Lemma 38.1 (A useful decomposition) Let \(a_1, \dots, a_n\) be any real numbers, and define \(\bar a \overset{\text{def}}{=}\frac{1}{n}\sum_{i=1}^n a_i\). Then, for any real number \(c\),
\[
\frac{1}{n} \sum_{i=1}^n (a_i - c)^2 = \frac{1}{n} \sum_{i=1}^n (a_i - \bar a)^2 + (\bar a - c)^2.
\tag{38.2}\]
Equation 38.2 can be proven using algebra, but here is a slick way that uses Theorem 31.1.
Define \(\hat\theta\) to be a random variable that takes on the values \(\{ a_1, \dots, a_n \}\), each with probability \(1/n\). Then \(\text{E}\!\left[ \hat\theta \right] = \bar a\) and \(\text{Var}\!\left[ \hat\theta \right] = \frac{1}{n} \sum_{i=1}^n (a_i - \bar a)\).
Now, by the bias-variance decomposition, we know that \[\begin{align}
\frac{1}{n} \sum_{i=1}^n (a_i - c)^2 &= \text{E}\!\left[ (\hat\theta - c)^2 \right] \\
&= \text{Var}\!\left[ \hat\theta \right] + (\text{E}\!\left[ \hat\theta \right] - c)^2 \\
&= \frac{1}{n} \sum_{i=1}^n (a_i - \bar a)^2 + (\bar a - c)^2,
\end{align}\] as we wanted to show.
An insightful consequence of Lemma 38.1 is that the mean \(\bar a\) is the value of \(c\) that is “closest” to the values \(a_1, \dots, a_n\). That is, \(\bar a\) minimizes the sum of squared deviations \[\frac{1}{n} \sum_{i=1}^n (a_i - c)^2,\] since if we look at the right-hand side of Equation 38.2, the first term does not depend on \(c\), and the second term is minimized by choosing \(c = \bar a\).
Now we apply Lemma 38.1 to the study of the variance estimator \(\hat\sigma^2\) (Equation 38.1). If we let \(a_i = X_i\) and \(c = \mu\), then Lemma 38.1 says
\[
\frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 = \underbrace{\frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2}_{\hat\sigma^2} + (\bar X - \mu)^2.
\] The first term on the right-hand side is the variance estimator. So we can express the estimator alternatively as \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 - (\bar X - \mu)^2. \tag{38.3}\] In this form, it is easy to assess the consistency and bias of \(\hat\sigma^2\).
Example 38.1 (Consistency and bias of \(\hat\sigma^2\)) The first term on the right-hand side of Equation 38.3 converges in probability to \(\text{E}\!\left[ (X_i - \mu)^2 \right] = \text{Var}\!\left[ X_i \right] = \sigma^2\) by the Law of Large Numbers. The second term converges in probability to \(0\), again by the Law of Large Numbers. Therefore, \(\hat\sigma^2\) converges in probability to \(\sigma^2\), and \(\hat\sigma^2\) is consistent.
On the other hand, it is biased. To show this, we calculate its expectation: \[\begin{align}
\text{E}\!\left[ \hat\sigma^2 \right] &= \text{E}\!\left[ \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \right] - \text{E}\!\left[ (\bar X - \mu)^2 \right] \\
&= \text{E}\!\left[ (X_i - \mu)^2 \right] - \text{E}\!\left[ (\bar X - \mu)^2 \right] \\
&= \text{Var}\!\left[ X_i \right] - \text{Var}\!\left[ \bar X \right] \\
&= \sigma^2 - \frac{\sigma^2}{n} \\
&= \frac{n-1}{n} \sigma^2,
\end{align}\] which is close to, but not quite equal to, \(\sigma^2\).
The Sample Variance
Looking at Example 38.1, the bias of \(\hat\sigma^2\) is easy to correct. We simply multiply by \(\frac{n}{n-1}\).
\[ \text{E}\!\left[ \frac{n}{n-1} \hat\sigma^2 \right] = \frac{n}{n-1} \text{E}\!\left[ \hat\sigma^2 \right] = \frac{n}{n-1} \frac{n-1}{n}\sigma^2 = \sigma^2. \]
This unbiased estimator for \(\sigma^2\) is called the sample variance.
Definition 38.1 (Sample variance) The sample variance, denoted \(S^2\), is defined as \[
S^2 = \frac{n}{n-1} \hat\sigma^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2.
\tag{38.4}\]
The only difference is that we divide by \(n-1\) instead of \(n\), but this is enough to make the estimator unbiased while preserving consistency.
Example 38.2 (Consistency and bias of \(S^2\)) We constructed \(S^2\) so that it would be unbiased for \(\sigma^2\).
As for consistency, since \(\hat\sigma^2\stackrel{p}{\to} \sigma^2\) (from Example 38.1) and \[ S^2 = \frac{n-1}{n} \hat\sigma^2, \] we have that \[ S^2 \stackrel{p}{\to} 1 \cdot \sigma^2, \] since \(\frac{n-1}{n} \to 1\). So \(S^2\) is also consistent for \(\sigma^2\).
Example 38.2 reminds us that consistency is quite a common property. In general, there will be many consistent estimators, not all of them good.
Comparing the MSEs
It might seem that Example 38.2 is the end of the story; we should always use \(S^2\) because it is both unbiased and consistent, whereas \(\hat\sigma^2\) is only consistent. However, we saw in Example 31.4 that a biased estimator can sometimes have a lower MSE.
Example 38.3 (MSEs of variance estimators) The sample variance \(S^2\) is unbiased, so its MSE is simply its variance: \[ \text{MSE}\!\left[ S^2 \right] = \text{Var}\!\left[ S^2 \right]. \]
On the other hand, \(\hat\sigma^2\) is biased, so its MSE is \[\begin{align}
\text{MSE}\!\left[ \hat\sigma^2 \right] &= (\text{E}\!\left[ \hat\sigma^2 \right] - \sigma^2)^2 + \text{Var}\!\left[ \hat\sigma^2 \right] \\
&= \left(\frac{n-1}{n} \sigma^2 - \sigma^2 \right)^2 + \text{Var}\!\left[ \frac{n-1}{n} S^2 \right] \\
&= \frac{\sigma^4}{n^2} + \left(\frac{n-1}{n} \right)^2 \text{Var}\!\left[ S^2 \right] \\
&= \frac{\sigma^4}{n^2} + \left(\frac{n-1}{n} \right)^2 \text{MSE}\!\left[ S^2 \right].
\end{align}\]
The factor of \(\left(\frac{n-1}{n} \right)^2\) shrinks the MSE, but it is counteracted by the addition of \(\frac{\sigma^4}{n^2}\). Depending on the exact distribution of \(X_1, \dots, X_n\), either \(S^2\) or \(\hat\sigma^2\) could have the lower MSE.
For example, we will show later in Example 45.3 that if \(X_1, \dots, X_n\) are i.i.d. normal, \[ \text{Var}\!\left[ S^2 \right] = \frac{2\sigma^4}{n - 1}. \] In that case, \[\begin{align}
\text{MSE}\!\left[ \hat\sigma^2 \right] &= \frac{\sigma^4}{n^2} + \left(\frac{n-1}{n} \right)^2 \frac{2\sigma^4}{n - 1} \\
&= \frac{2\sigma^4}{n} - \frac{\sigma^4}{n^2},
\end{align}\] which is less than \(\text{MSE}\!\left[ S^2 \right]\), so it is possible for the biased estimator to have the lower MSE.
Estimating the Standard Deviation
We often want to know the standard deviation \(\sigma\) instead of the variance \(\sigma^2\). Since \(\sigma = \sqrt{\sigma^2}\), a natural estimator for \(\sigma\) is \(S = \sqrt{S^2}\). What are the properties of \(S\), the sample standard deviation?
Example 38.4 (Consistency and bias of \(S\)) Since \(g(x) = \sqrt{x}\) is a continuous function, we know by the Continuous Mapping Theorem (Theorem 37.2) that \[ S = g(S^2) \stackrel{p}{\to} g(\sigma^2) = \sigma, \] so \(S\) is consistent for the standard deviation.
However, it is biased. Since \(g(x) = \sqrt{x}\) is a concave function, \(-g(x)\) is a convex function. Since \(g(x)\) is also not linear, we know by Jensen’s inequality that \[ \text{E}\!\left[ -g(S^2) \right] > -g(\text{E}\!\left[ S^2 \right]) \] or in other words, \[ \text{E}\!\left[ \sqrt{S^2} \right] < \sqrt{\text{E}\!\left[ S^2 \right]}. \] But the left-hand side is \(\text{E}\!\left[ S \right]\) and the right-hand side is \(\sqrt{\sigma^2} = \sigma\), so \(S\) has negative bias and tends to underestimate \(\sigma\).