38  Estimating Variances

We have seen that when we have \(X_1, \dots, X_n\) i.i.d. from any distribution, a general estimator for \(\mu = \text{E}\!\left[ X_1 \right]\) is \(\bar X\). Although not always optimal, \(\bar X\) is unbiased and consistent for \(\mu\).

Is there a similar estimator for \(\sigma^2 = \text{Var}\!\left[ X_1 \right]\)? We explore this question in this chapter.

38.1 A Preliminary Estimator

Since \(\text{Var}\!\left[ X_1 \right] \overset{\text{def}}{=}\text{E}\!\left[ (X_1 - \mu)^2 \right]\), one idea is to replace all of the expectations by their sample averages. That is, we plug in \(\bar X\) for \(\mu\) and estimate the variance as

\[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2. \tag{38.1}\] Is this a good estimator? The following lemma will help.

Lemma 38.1 (A useful decomposition) Let \(a_1, \dots, a_n\) be any real numbers, and define \(\bar a \overset{\text{def}}{=}\frac{1}{n}\sum_{i=1}^n a_i\). Then, for any real number \(c\),

\[ \frac{1}{n} \sum_{i=1}^n (a_i - c)^2 = \frac{1}{n} \sum_{i=1}^n (a_i - \bar a)^2 + (\bar a - c)^2. \tag{38.2}\]

Proof

Equation 38.2 can be proven using algebra, but here is a slick way that uses Theorem 31.1.

Define \(\hat\theta\) to be a random variable that takes on the values \(\{ a_1, \dots, a_n \}\), each with probability \(1/n\). Then \(\text{E}\!\left[ \hat\theta \right] = \bar a\) and \(\text{Var}\!\left[ \hat\theta \right] = \frac{1}{n} \sum_{i=1}^n (a_i - \bar a)\).

Now, by the bias-variance decomposition, we know that \[\begin{align} \frac{1}{n} \sum_{i=1}^n (a_i - c)^2 &= \text{E}\!\left[ (\hat\theta - c)^2 \right] \\ &= \text{Var}\!\left[ \hat\theta \right] + (\text{E}\!\left[ \hat\theta \right] - c)^2 \\ &= \frac{1}{n} \sum_{i=1}^n (a_i - \bar a)^2 + (\bar a - c)^2, \end{align}\] as we wanted to show.

An insightful consequence of Lemma 38.1 is that the mean \(\bar a\) is the value of \(c\) that is “closest” to the values \(a_1, \dots, a_n\). That is, \(\bar a\) minimizes the sum of squared deviations \[\frac{1}{n} \sum_{i=1}^n (a_i - c)^2,\] since if we look at the right-hand side of Equation 38.2, the first term does not depend on \(c\), and the second term is minimized by choosing \(c = \bar a\).

Now we apply Lemma 38.1 to the study of the variance estimator \(\hat\sigma^2\) (Equation 38.1). If we let \(a_i = X_i\) and \(c = \mu\), then Lemma 38.1 says

\[ \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 = \underbrace{\frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2}_{\hat\sigma^2} + (\bar X - \mu)^2. \] The first term on the right-hand side is the variance estimator. So we can express the estimator alternatively as \[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 - (\bar X - \mu)^2. \tag{38.3}\] In this form, it is easy to assess the consistency and bias of \(\hat\sigma^2\).

Example 38.1 (Consistency and bias of \(\hat\sigma^2\)) The first term on the right-hand side of Equation 38.3 converges in probability to \(\text{E}\!\left[ (X_i - \mu)^2 \right] = \text{Var}\!\left[ X_i \right] = \sigma^2\) by the Law of Large Numbers. The second term converges in probability to \(0\), again by the Law of Large Numbers. Therefore, \(\hat\sigma^2\) converges in probability to \(\sigma^2\), and \(\hat\sigma^2\) is consistent.

On the other hand, it is biased. To show this, we calculate its expectation: \[\begin{align} \text{E}\!\left[ \hat\sigma^2 \right] &= \text{E}\!\left[ \frac{1}{n} \sum_{i=1}^n (X_i - \mu)^2 \right] - \text{E}\!\left[ (\bar X - \mu)^2 \right] \\ &= \text{E}\!\left[ (X_i - \mu)^2 \right] - \text{E}\!\left[ (\bar X - \mu)^2 \right] \\ &= \text{Var}\!\left[ X_i \right] - \text{Var}\!\left[ \bar X \right] \\ &= \sigma^2 - \frac{\sigma^2}{n} \\ &= \frac{n-1}{n} \sigma^2, \end{align}\] which is close to, but not quite equal to, \(\sigma^2\).

38.2 The Sample Variance

Looking at Example 38.1, the bias of \(\hat\sigma^2\) is easy to correct. We simply multiply by \(\frac{n}{n-1}\).

\[ \text{E}\!\left[ \frac{n}{n-1} \hat\sigma^2 \right] = \frac{n}{n-1} \text{E}\!\left[ \hat\sigma^2 \right] = \frac{n}{n-1} \frac{n-1}{n}\sigma^2 = \sigma^2. \]

This unbiased estimator for \(\sigma^2\) is called the sample variance.

Definition 38.1 (Sample variance) The sample variance, denoted \(S^2\), is defined as \[ S^2 = \frac{n}{n-1} \hat\sigma^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2. \tag{38.4}\]

The only difference is that we divide by \(n-1\) instead of \(n\), but this is enough to make the estimator unbiased while preserving consistency.

Example 38.2 (Consistency and bias of \(S^2\)) We constructed \(S^2\) so that it would be unbiased for \(\sigma^2\).

As for consistency, since \(\hat\sigma^2\stackrel{p}{\to} \sigma^2\) (from Example 38.1) and \[ S^2 = \frac{n-1}{n} \hat\sigma^2, \] we have that \[ S^2 \stackrel{p}{\to} 1 \cdot \sigma^2, \] since \(\frac{n-1}{n} \to 1\). So \(S^2\) is also consistent for \(\sigma^2\).

Example 38.2 reminds us that consistency is quite a common property. In general, there will be many consistent estimators, not all of them good.

38.3 Comparing the MSEs

It might seem that Example 38.2 is the end of the story; we should always use \(S^2\) because it is both unbiased and consistent, whereas \(\hat\sigma^2\) is only consistent. However, we saw in Example 31.4 that a biased estimator can sometimes have a lower MSE.

Example 38.3 (MSEs of variance estimators) The sample variance \(S^2\) is unbiased, so its MSE is simply its variance: \[ \text{MSE}\!\left[ S^2 \right] = \text{Var}\!\left[ S^2 \right]. \]

On the other hand, \(\hat\sigma^2\) is biased, so its MSE is \[\begin{align} \text{MSE}\!\left[ \hat\sigma^2 \right] &= (\text{E}\!\left[ \hat\sigma^2 \right] - \sigma^2)^2 + \text{Var}\!\left[ \hat\sigma^2 \right] \\ &= \left(\frac{n-1}{n} \sigma^2 - \sigma^2 \right)^2 + \text{Var}\!\left[ \frac{n-1}{n} S^2 \right] \\ &= \frac{\sigma^4}{n^2} + \left(\frac{n-1}{n} \right)^2 \text{Var}\!\left[ S^2 \right] \\ &= \frac{\sigma^4}{n^2} + \left(\frac{n-1}{n} \right)^2 \text{MSE}\!\left[ S^2 \right]. \end{align}\]

The factor of \(\left(\frac{n-1}{n} \right)^2\) shrinks the MSE, but it is counteracted by the addition of \(\frac{\sigma^4}{n^2}\). Depending on the exact distribution of \(X_1, \dots, X_n\), either \(S^2\) or \(\hat\sigma^2\) could have the lower MSE.

For example, we will show later in Example 45.3 that if \(X_1, \dots, X_n\) are i.i.d. normal, \[ \text{Var}\!\left[ S^2 \right] = \frac{2\sigma^4}{n - 1}. \] In that case, \[\begin{align} \text{MSE}\!\left[ \hat\sigma^2 \right] &= \frac{\sigma^4}{n^2} + \left(\frac{n-1}{n} \right)^2 \frac{2\sigma^4}{n - 1} \\ &= \frac{2\sigma^4}{n} - \frac{\sigma^4}{n^2}, \end{align}\] which is less than \(\text{MSE}\!\left[ S^2 \right]\), so it is possible for the biased estimator to have the lower MSE.

38.4 Estimating the Standard Deviation

We often want to know the standard deviation \(\sigma\) instead of the variance \(\sigma^2\). Since \(\sigma = \sqrt{\sigma^2}\), a natural estimator for \(\sigma\) is \(S = \sqrt{S^2}\). What are the properties of \(S\), the sample standard deviation?

Example 38.4 (Consistency and bias of \(S\)) Since \(g(x) = \sqrt{x}\) is a continuous function, we know by the Continuous Mapping Theorem (Theorem 37.2) that \[ S = g(S^2) \stackrel{p}{\to} g(\sigma^2) = \sigma, \] so \(S\) is consistent for the standard deviation.

However, it is biased. Since \(g(x) = \sqrt{x}\) is a concave function, \(-g(x)\) is a convex function. Since \(g(x)\) is also not linear, we know by Jensen’s inequality that \[ \text{E}\!\left[ -g(S^2) \right] > -g(\text{E}\!\left[ S^2 \right]) \] or in other words, \[ \text{E}\!\left[ \sqrt{S^2} \right] < \sqrt{\text{E}\!\left[ S^2 \right]}. \] But the left-hand side is \(\text{E}\!\left[ S \right]\) and the right-hand side is \(\sqrt{\sigma^2} = \sigma\), so \(S\) has negative bias and tends to underestimate \(\sigma\).