31  Comparing Estimators: Variance and Mean Squared Error

In the last chapter we left off with two unbiased estimators for the German tank problem, not knowing which one to use. In this chapter, we learn about mean squared error, a criterion that helps us determine which of two estimators is better. As we will see, along with the bias of an esimator, mean squared error depends on the variance of an estimator as well.

31.1 Variance of an Estimator

In the german tank problem we observed \(n\) samples \(X_1, \dots, X_n\) taken without replacement from the set \(\{1, \dots, N\}\) and wanted to estimate \(N\). We started off with the maximum likelihood estimate \(\hat{N}_{\textrm{MLE}}\), which was biased and did not appear to perform too well. In response, we came up with two unbiased estimators of \(N\):

  1. \(\hat{N}_{\textrm{MLE}+}=\frac{n+1}{n}\max(X_1, \dots, X_n)-1\)

  2. \(\hat{N}_{avg} = 2\bar{X} - 1\)

To get an idea of which of these two estimators is better, we can run simulations and see which estimator tends to end up closer to the parameter \(N\).

Example 31.1 (Comparing unbiased German tank estimators via simulation) Fixing \(N=270\) and \(n=10\), we run \(B=1000\) simulations where we take \(n\) samples \(X_1, \dots, X_n\) without replacement from \(\{1, \dots, N\}\) and compute \(\hat{N}_{\textrm{MLE}+}\) and \(\hat{N}_{avg}\) from above. We then plot the results.

The plot suggests that \(\hat{N}_{\textrm{MLE}+}\) provides a better estimate of \(N\) than \(\hat{N}_{avg}\). Roughly speaking, both estimators’ distributions are centered around the parameter \(N\) (although the distribution of \(\hat{N}_{\textrm{MLE}+}\) is not symmetric). This should come as no surprise, as both distributions are unbiased. The distribution of \(\hat{N}_{avg}\), however, is much more spread out. This means that, if we repeatedly draw \(n\) samples without replacement from \(\{1, \dots, N\}\) and compute our two estimators, \(\hat{N}_{avg}\) will be farther from \(N\) more often than \(\hat{N}_{\textrm{MLE}+}\).

In Example 31.1, both \(\hat{N}_{\textrm{MLE}+}\) and \(\hat{N}_{avg}\) have distributions that are centered in the right place, but \(\hat{N}_{\textrm{MLE}+}\) seems to be less variable and therefore a better estimate. Naturally, we can measure the variability of an estimator by its variance (recall Definition 11.1 and Definition 21.1).

Definition 31.1 (Variance of an estimator) The variance of an estimator \(\hat\theta\) for estimating a parameter \(\theta\) is \(\text{Var}\!\left[ \hat{\theta} \right].\)

Figure 31.1 soon displays how variance, along with bias, contributes to estimation error. If an estimator is biased, then its distribution is essentially centered in the wrong place. On average, a biased estimator takes a value that is different from the parameter it is trying to estimate. If an estimator is highly variable, then its value will fluctuate signficantly when we re-draw the data. Thus, even if the estimator is unbiased and its distribution is centered in the right place, the particular dataset we drew may just happen to result in an estimate that is far from the parameter. Ideally, a good estimator should have both low bias and low variance.

Figure 31.1: Illustration of what the distribution of a estimator \(\hat{\theta}\) for \(\theta\) that has low/high bias and low/high variance may look like.

Example 31.2 explicitly computes the variance of our unbiased estimators \(\hat{N}_{\textrm{MLE}+}\) and \(\hat N_{avg}\) from the German tank problem. The estimator \(\hat{N}_{\textrm{MLE}+}\) indeed has lower variance, as our simulations suggested.

Example 31.2 (Variance of German tank estimators) Again suppose we take \(n\) samples \(X_1, \dots, X_n\) without replacement from \(\{1, \dots, N\}\) and consider the estimators \(\hat{N}_{\textrm{MLE}+}\) and \(\hat N_{avg}\) from above. Computing the variance of these estimators is a bit involved, so we leave it as optional reading. But the variance of both estimators can be written in closed form.

\[ \text{Var}\!\left[ \hat{N}_{\textrm{MLE}+} \right] = \frac{1}{n} \cdot \frac{(N - n)(N + 1)}{n+2} \]

Coming soon! For now see here

\[ \text{Var}\!\left[ \hat{N}_{avg} \right] = \frac{n}{n+1} \cdot \frac{(N - n)(N + 1)}{n+2} \]

Coming soon! For now see here

Comparing the two variances, it’s not hard to see that the variance of \(\hat N_{\textrm{MLE}+}\) is the same as that of \(\hat N_{avg}\)when \(n=1\) and strictly lower when \(n > 1\). In fact, the variance can be substantially lower when \(n\) is large.

Of the two unbiased estimators, \(\hat{N}_{\textrm{MLE}+}\) has lower variance, making it seem to be the more favorable choice. But if our goal was to come up with low variance estimators, should we ever have discarded our earlier biased maximum likelihood estimate \(\hat{N}_{\textrm{MLE}}\)? A quick computation shows that \(\hat{N}_{MLE}\) has strictly lower variance than \(\hat{N}_{\textrm{MLE}+}\): \[\begin{align*} \text{Var}\!\left[ \hat{N}_{\textrm{MLE}+} \right] &= \text{Var}\!\left[ \frac{n+1}{n}\max(X_1, \dots, X_n)-1 \right] \\ &= \left(\frac{n+1}{n}\right)^2 \text{Var}\!\left[ \max(X_1, \dots, X_n) \right] \\ &> \text{Var}\!\left[ \max(X_1, \dots, X_n) \right] \\ &= \text{Var}\!\left[ \hat{N}_{\textrm{MLE}} \right]. \end{align*}\] Yet, in simulations, \(\hat{N}_{\textrm{MLE}+}\) appears to significantly out-perform \(\hat{N}_{\textrm{MLE}}\). How can we argue that the decrease in bias that from \(\hat{N}_{\textrm{MLE}+}\) is worth the increase in variance?

31.2 Mean Squared Error and the Bias-Variance Trade-off.

Mean squared error, the subject of our next definition, is a natural criterion for comparing two estimators. It allows us to formalize how bias and variance both contribute to estimation error.

Definition 31.2 (Mean squared error) The mean squared error of an estimator \(\hat\theta\) for estimating a parameter \(\theta\), given by \[ \text{E}\!\left[ (\hat{\theta} - \theta)^2 \right],\] is the expected square distance between \(\hat{\theta}\) and \(\theta\).

In essence, the MSE of \(\hat{\theta}\) quantifies how far \(\hat{\theta}\) is from \(\theta\) on average. Naturally, this makes it intuitive to label the estimator that minimizes MSE as the “best” one. Depending on the context, however, alternative criteria other than MSE may be more appropriate for evaluating an estimator’s performance. For example, if your primary concern is ensuring that the estimator \(\hat{\theta}\) is often with \(\pm \delta\) of the parameter \(\theta\) (but you are not concerned with precisely how close it is), you may instead look for the estimator \(\hat{\theta}\) that minimizes \(P(|\hat{\theta} - \theta| > \delta)\). Despite there being many criteria that we can use to evaluate estimators, however, MSE remains the most popular, in large part because it is easy to interpret and compute.

Theorem 31.1 tells us that the MSE of an estimator is determined by exactly two things: its bias and its variance. Since an estimator’s t bias and variance are often straightforward to calculate, Theorem 31.1 provides a convenient way to compute an estimator’s MSE.

Theorem 31.1 (The bias-variance tradeoff) The mean squared error of an \(\hat\theta\) of \(\theta\) is equal to the estimator’s bias squared plus the estimator’s variance:

\[ \text{E}\!\left[ (\hat{\theta} - \theta)^2 \right] = \underbrace{(\text{E}\!\left[ \hat{\theta} \right] - \theta)^2}_{\text{Bias}^2} + \underbrace{\text{Var}\!\left[ \hat{\theta} \right]}_{\text{Variance}}. \]

Proof

\[\begin{align*} &\text{E}\!\left[ (\hat{\theta} - \theta)^2 \right] & \\ &= \text{E}\!\left[ ((\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right]) + (\text{E}\!\left[ \hat{\theta} \right] - \theta))^2 \right] & \text{(add and substract $\text{E}\!\left[ \hat{\theta} \right] $)}\\ &= \text{E}\!\left[ (\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right])^2 + 2(\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right])(\text{E}\!\left[ \hat{\theta} \right] - \theta) + (\text{E}\!\left[ \hat{\theta} \right] - \theta)^2 \right] & \text{(expand the square)} \\ &= \text{E}\!\left[ (\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right])^2 \right] + 2 \text{E}\!\left[ (\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right])\underbrace{(\text{E}\!\left[ \hat{\theta} \right] - \theta)}_{\text{constant}} \right] + \text{E}\!\left[ (\text{E}\!\left[ \hat{\theta} \right] - \theta)^2 \right] & \text{(linearity of expectation)} \\ &= \text{E}\!\left[ (\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right])^2 \right] + 2 (\text{E}\!\left[ \hat{\theta} \right] - \theta) \text{E}\!\left[ \hat{\theta} - \text{E}\!\left[ \hat{\theta} \right] \right] + \text{E}\!\left[ (\text{E}\!\left[ \hat{\theta} \right] - \theta)^2 \right] & \text{(linearity of expectation)} \\ &= \text{E}\!\left[ (\hat{\theta} - \text{E}\!\left[ \hat{\theta} \right])^2 \right] + 2 (\text{E}\!\left[ \hat{\theta} \right] - \theta) \underbrace{(\text{E}\!\left[ \hat{\theta} \right] - \text{E}\!\left[ \hat{\theta} \right]) }_{=0} + \text{E}\!\left[ (\text{E}\!\left[ \hat{\theta} \right] - \theta)^2 \right] & \text{(linearity of expectation)} \\ &= \text{Var}\!\left[ \hat{\theta} \right] + (\text{E}\!\left[ \hat{\theta} \right] - \theta)^2 & \text{(definition of variance)} \end{align*}\]

Theorem 31.1 formalizes the intuition we’ve been building over the last two chapters. It clearly illustrates why we aim to design estimators with both low bias and low variance. If either the bias or the variance of an estimator is large, then the estimator’s MSE will be large as well. Note that if an estimator is unbiased, then its MSE is equal exactly to its variance. Thus, among unbiased estimators, the best estimator is the one with minimum variance.

For reasons we began to see in the previous section, the relationship between bias and variance is often referred to as the bias-variance trade-off. Not infrequently, decreasing bias comes at the cost of increasing variance, or vice-versa. This is exactly where we left of with the German tank problem. Is the reduced bias of \(\hat{N}_{\textrm{MLE}+}\) worth the increased variance? We investigate this question in Example 31.3.

Example 31.3 (Comparing German tank estimators via MSE) Again suppose we take \(n\) samples \(X_1, \dots, X_n\) without replacement from \(\{1, \dots, N\}\) and consider the estimators \(\hat{N}_{\textrm{MLE}}\) (Equation 30.3) and \(\hat{N}_{\textrm{MLE}+}\) and \(\hat{N}_{avg}\) of \(N\) from above. Which of these three estimators has the lowest MSE? To simplify the analysis, we will assume that \(N \geq 5\) and that we draw at least \(n \geq 5\) samples.

We know from Chapter 30 that both \(\hat{N}_{\textrm{MLE}+}\) and \(\hat{N}_{avg}\) are unbiased. By Theorem 31.1, the MSE of both these estimators is exactly given by their variance. Since we already learned in Example 31.2 \(\hat{N}_{\textrm{MLE}+}\) has strictly lower variance than \(\hat{N}_{avg}\), we know that \(\hat{N}_{\textrm{MLE}+}\) has strictly lower MSE.

Considering \(\hat{N}_{\textrm{MLE}}\), we already computed in Example 30.1 that the bias of \(\hat{N}_{MLE}\) is \((n-N)/(n+1)\), and we noted earlier that \[ \text{Var}\!\left[ \hat{N}_{\textrm{MLE}} \right] = \left(\frac{n}{n+1}\right)^2 \text{Var}\!\left[ \hat{N}_{\textrm{MLE}+} \right] \] Applying Theorem 31.1 we can compute the MSE of \(\hat{N}_{\textrm{MLE}}\) in a way that allows us to directly compare it to that of \(\hat{N}_{\textrm{MLE}+}\):

\[\begin{align*} \text{E}\!\left[ (\hat{N}_{\textrm{MLE}} - N)^2 \right] &= \left( \frac{N-n}{n + 1}\right)^2 + \left( \frac{n}{n+1}\right)^2 \text{Var}\!\left[ \hat{N}_{\textrm{MLE}+} \right] \\ \end{align*}\]

Thus \(\hat{N}_{\textrm{MLE}+}\) results in a lower MSE exactly when \[ \text{E}\!\left[ (\hat{N}_{\textrm{MLE}} - N)^2 \right] - \text{E}\!\left[ (\hat{N}_{\textrm{MLE}+} - N)^2 \right] = \left( \frac{N-n}{n + 1}\right)^2 + \left( \left(\frac{n}{n+1}\right)^2 - 1\right)\text{Var}\!\left[ \hat{N}_{\textrm{MLE}+} \right] \] is positive.

Checking when this condition holds requires quite a bit of algebra, but it turns out to be true whenever \(N > n + 2\). That means that unless we sample the whole population, all but one of the population, or all but two of the population, \(\hat{N}_{\textrm{MLE}+}\) will have smaller MSE than \(\hat{N}_{\textrm{MLE}}\).

In closing, we mention that it makes sense that \(\hat{N}_{\textrm{MLE}}\) cannot be beat when \(n\) is very close to \(N\). If \(n = N\) exactly, then \(\hat{N}_{MLE}\) will be \(N\) exactly. Similarly, \(n\) is very close to \(N\), then it is highly likely that our largest sample is indeed \(N\), or not far from it, and \(\hat{N}_{\textrm{MLE}}\) will be a very good estimator of \(N\).

In the German tank problem our de-biased estimator had a lower MSE than our original estimator in almost every setting. It is not always the case, however, that de-biasing an estimator results in a new estimator that typically performs better than the original estimator. As Example 31.4 illustrates, de-biasing an estimator can sometimes result in big increases in variance relative to the decrease in bias, so much so that new estimator always has higher MSE than the original.

Example 31.4 (A better but biased estimator) Suppose we observe \(n\) i.i.d samples \(X_1, \dots, X_n\) that each have an \(\text{Exponential}(\lambda)\) distribution (Definition 22.2) and we want to estimate the mean \(\mu = 1/\lambda\) of this distribution. Because \(\text{E}\!\left[ X_1 \right] = \mu\), a natural estimator is given by the sample mean (Equation 30.5) \(\hat{\mu} = \bar{X}\). By Proposition 30.1, this estimator is unbiased. By Theorem 31.1, its MSE is therefore given by its variance. Because the \(X_i\) are independent, the variance of their sum is the sum of their variances, and we can compute the MSE to be \[ \text{Var}\!\left[ \bar{X} \right] = \text{Var}\!\left[ \frac{1}{n}\sum_{i=1}^n X_i \right] = \frac{1}{n^2}\text{Var}\!\left[ \sum_{i=1}^n X_i \right] = \frac{1}{n^2}\sum_{i=1}^n \text{Var}\!\left[ X_i \right] = \frac{1}{n^2} n \text{Var}\!\left[ X_1 \right] = \frac{\text{Var}\!\left[ X_1 \right]}{n} = \frac{\mu^2}{n}. \] In comparison, consider the biased estimator \[ \hat{\mu}_{\text{biased}} = \frac{n}{n+1} \hat{\mu}, \] which is just our unbiased estimator \(\hat{\mu}\) scaled by \(n/(n+1)\). We can compute this estimator’s squared bias

\[\begin{align*} (\text{E}\!\left[ \hat{\mu}_{\text{biased}} \right] - \mu)^2 &= (\text{E}\!\left[ \frac{n}{n+1} \hat{\mu} \right] - \mu)^2 & \\ &= (\frac{n}{n+1} \text{E}\!\left[ \hat{\mu} \right] - \mu)^2 & \text{(linearity of expectation)}\\ &= (\frac{n}{n+1} \mu - \mu)^2 & \text{($\hat{\mu}$ is unbiased)} \\ &= \frac{1}{(n+1)^2} \mu^2 & \text{(factor and simplify)} \\ \\ \end{align*}\]

and its variance \[ \text{Var}\!\left[ \hat{\mu}_{\text{biased}} \right] = \text{Var}\!\left[ \frac{n}{n+1} \hat{\mu} \right] = \left(\frac{n}{n+1}\right)^2 \text{Var}\!\left[ \hat{\mu} \right] = \frac{n}{(n+1)^2}\mu^2 \] Putting the two computations together, Theorem 31.1 tells us that the MSE of \(\hat{\mu}_{\text{biased}}\) is \[ (\text{E}\!\left[ \frac{n}{n+1} \hat{\mu} \right] - \mu)^2 + \text{Var}\!\left[ \hat{\mu}_{\text{biased}} \right] = \left( \frac{1}{(n+1)^2} + \frac{n}{(n+1)^2} \right) \mu^2 = \frac{\mu^2}{n+1} \]

Surprisingly, the biased estimator \(\hat{\mu}_{\text{biased}}\) always has strictly lower MSE than the unbiased estimator \(\hat{\mu}\)! In other words, if we took the estimator \(\hat{\mu}_{\text{biased}}\) and de-biased it, we’d end up with an estimator that always has higher MSE.

In Example 31.4 we computed the variance of the sample mean when we had i.i.d samples from an exponential distribution. The identical computation applies whenever we have i.i.d samples from any distribution, and Proposition 31.1 uses this to derive the MSE of the sample mean more generally.

Proposition 31.1 (Standard error and MSE of the sample mean) Let \(X_1, \dots, X_n\) be i.i.d random variables from any distribution with finite variance \(\sigma^2 = \text{Var}\!\left[ X_1 \right]\). Then the variance of the sample mean (Equation 30.5) is \[\text{Var}\!\left[ \bar{X} \right] = \frac{\sigma^2}{n},\]

which is also the MSE of the sample mean when it is used as an estimator of \(\mu = \text{E}\!\left[ X_1 \right]\).

Proof

An identical computation to that in Example 31.4 implies that \(\text{Var}\!\left[ \bar{X} \right] = \sigma^2/n\). The fact that \(\sigma^2/n\) is the MSE of the sample mean as an estimator of \(\mu\) then follows directly from Proposition 30.1 and Theorem 31.1.

Proposition 31.1 illustrates the standard approach for coming up with estimators with less variance: gather more data. Unlike with bias, where we could try and tweak our original estimator to reduce the bias (as we did in Example 30.2), it is usually difficult to come up with reasonable modifications to an estimator that reduce its variance. Instead, estimators most commonly have low variance when they make use of a large number of samples. This is best illustrated by Example 31.5, where we consider different ways of estimating the mean from \(n\) i.i.d normal samples.

Example 31.5 (Estimating the normal mean with \(n\) samples) Suppose we observe \(n\) i.i.d samples \(X_1, \dots, X_n\) that have a normal distribution (Definition 22.4) with unknown mean \(\mu\) and variance \(\sigma^2\). Your three friends each propose an estimator \(\hat{\mu}\) for \(\mu\). Which of these three estimators should you use?

  1. The constant estimator: Your first friend has a strong hunch that the true mean is \(\mu=0\), so they tell you to ignore the data and use the constant estimator \(\hat{\mu}_1=0\).

  2. The one-sample estimator: Your second friend tells you to just use the first sample, and estimate \(\mu\) with \(\hat{\mu}_2 = X_1\). Although you’re skeptical of discarding most of the data, your friend assures you that this estimator is unbiased and will do a good job.

  3. The sample mean: Your third friend suggests using the sample mean (Equation 30.5) \(\hat{\mu}_3 = \bar{X}\) as an estimate for \(\mu\).

To get a feel for how the three estimators perform, we run some simulations where we fix \(n=20\), \(\sigma^2=1\), and \(\mu=2\)

As we vary \(n\), \(\sigma^2\), and \(\mu\) a few things become apparent. First, despite having no variance, the first estimator \(\hat{\mu}_1\) can have a huge amount of bias when \(\mu\) is far from zero. It is of course unbeatable when \(\mu=0\), but since we have no way of knowing what \(\mu\) is a priori, the potential for huge bias makes it a bad estimator. Second, although the one-sample estimator \(\hat{\mu}_2\) is unbiased, it is much more variable than the also unbiased estimator \(\hat{\mu}_3\). By using all the samples, \(\hat{\mu}_3\) results in a lower variance estimate that appears to have the lowest MSE among the three.

We can formalize these observations by computing the MSE of each estimate.

  1. MSE of \(\hat{\mu}_1\): Since \(\hat{\mu}_1 = 0\) is a constant, its expectation is itself and its variance is zero. Therefore Theorem 31.1 tells us that the squared bias of \(\hat{\mu}\), given by \((0 - \mu)^2 = \mu^2\) is exactly the MSE.

  2. MSE of \(\hat{\mu}_2\): Since \(\text{E}\!\left[ X_1 \right] = \mu\), the estimator \(\hat{\mu}_2 = X_1\) is indeed unbiased. Theorem 31.1 tells us that its variance \(\text{Var}\!\left[ X_1 \right] = \sigma^2\) is therefore its MSE.

  3. MSE of \(\hat{\mu}_3\): Proposition 31.1 tells us that the MSE of \(\hat{\mu}_3 = \bar{X}\) is \(\sigma^2/n\).

The MSE computations back up the results from our simulations and illustrate that \(\hat{\mu}_3\) has a very desirable property over \(\hat{\mu}_1\) and \(\hat{\mu}_{2}\). Regardless of the value of the parameter \(\mu\) we are estimating, the MSE of \(\hat{\mu}_3\) shrinks to zero as the number of samples \(n\) we observe increases. The same cannot be said about \(\hat{\mu}_1\) and \(\mu_{2}\).

Example 31.5 clearly illustrates why we get better estimates with more data. Sure, we can often make an unbiased estimator with just \(n=1\) sample. But, we can make an unbiased estimate with much lower variance if we have many of them.