48  Confidence Intervals

In this book so far, we have focused mostly on point estimation, which focuses on reporting a single number to estimate a parameter of interest, ignoring the uncertainty in this estimate. In this chapter, we will discuss how to report a range of plausible values for a parameter, capturing the uncertainty in our estimate. This range of values is called a confidence interval and has a fundamental connection with the hypothesis tests from Chapter 47.

48.1 The \(z\)-interval

Recall Example 47.1, where the quality engineers take daily samples of \(n=5\) ball bearings from a production line. One day, the diameters of the ball bearings (in mm) are measured to be \[ X_1 = 10.06, X_2 = 10.07, X_3 = 9.98, X_4 = 10.02, X_5 = 10.09. \] It is assumed that \(X_1, \dots, X_5\) are i.i.d. \(\text{Normal}(\mu, \sigma^2 = 0.03^2)\).

In Example 47.1, we tested whether the production line was producing ball bearings that met the specification of \(\mu = 10\) mm. Alternatively, we can estimate \(\mu\), the mean diameter of ball bearings produced by the line.

We know from Example 30.3 that the maximum likelihood estimate of \(\mu\) is \[ \bar X = 10.044. \]

This is a point estimate of \(\mu\). We can quantify the uncertainty by reporting a 95% confidence interval.

We know that \[ Z = \frac{\bar X - \mu}{\sqrt{\sigma^2 / n}} \tag{48.1}\] is standard normal, so by definition, \[ P(\Phi^{-1}(0.025) < Z < \Phi^{-1}(0.975)) = 0.95. \] (Note that \(\Phi^{-1}(0.975) = -\Phi^{-1}(0.025) \approx 1.96 \approx 2\).)

Substituting Equation 48.1 for \(Z\), we can rearrange the inequalities so that \(\mu\) is in the middle: \[ \begin{align} &P\left(\Phi^{-1}(0.025) \leq \frac{\bar X - \mu}{\sqrt{\sigma^2 / n}} \leq \Phi^{-1}(0.975)\right) \\ &= P\left(\Phi^{-1}(0.025) \sqrt{\frac{\sigma^2}{n}} \leq \bar X - \mu \leq \Phi^{-1}(0.975)\sqrt{\frac{\sigma^2}{n}} \right) \\ &= P\left(\bar X - \Phi^{-1}(0.025)\sqrt{\frac{\sigma^2}{n}} \geq \mu \geq \bar X - \Phi^{-1}(0.975)\sqrt{\frac{\sigma^2}{n}}\right) \\ \end{align} \]

That is, the random interval \[ \left[\bar X - \Phi^{-1}(0.975) \sqrt{\frac{\sigma^2}{n}}, \bar X - \Phi^{-1}(0.025) \sqrt{\frac{\sigma^2}{n}}\right] \] has a 95% probability of containing \(\mu\). This is called a 95% confidence interval.

For the ball bearings data, a 95% confidence interval is \[ \left[10.044 - 1.96 \sqrt{\frac{0.03^2}{5}}, 10.044 + 1.96 \sqrt{\frac{0.03^2}{5}}\right] = [10.0177, 10.0703]. \tag{48.2}\] We say that we are 95% confident that \(\mu\) is between \(10.0177\) mm and \(10.0703\) mm. We cannot know for certain whether \(\mu\) is in this interval or not, but we know that an interval constructed in this way will contain \(\mu\) 95% of the time. We can illustrate this via simulation.

About 95% of these intervals contain \(\mu = 10.02\) in this simulation. In practice, we only ever observe one of these intervals, and we have no way of knowing whether \(\mu\) is in that interval or not. However, we hope that our interval is one of 95% of intervals that do contain \(\mu\), as opposed to the 5% that do not.

The next proposition summarizes the results of this section, generalizing them to confidence levels other than 95%.

Proposition 48.1 (\(z\)-interval for a normal mean) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\) random variables, where \(\sigma^2\) is known.

A \((1-\alpha)\) confidence interval for a mean \(\mu\) is \[ \left[\bar X - \Phi^{-1}(1 - \alpha/2) \sqrt{\frac{\sigma^2}{n}}, \bar X - \Phi^{-1}(\alpha/2) \sqrt{\frac{\sigma^2}{n}} \right] = \bar X \pm \Phi^{-1}(1 - \alpha/2) \sqrt{\frac{\sigma^2}{n}}. \]

In the rest of this chapter, we will focus on the case where \(\alpha = 0.05\) (95% confidence intervals), although generalizing the results to other values of \(\alpha\) is straightforward.

48.2 Duality of confidence intervals and hypothesis tests

The 95% confidence interval that we constructed in Equation 48.2 did not contain \(\mu = 10.00\), which agrees with our decision in Example 47.1 to reject the null hypothesis that \(\mu = 10.00\). This is no coincidence. A 95% confidence interval will contain \(\mu_0\) if and only if the null hypothesis \(H_0: \mu = \mu_0\) cannot be rejected (i.e., the \(p\)-value is above 5%).

Proposition 48.2 (Duality of confidence intervals and hypothesis tests) Suppose that a hypothesis test of \(H_0: \theta = \theta_0\) rejects when \[ \vec X \in R(\theta_0) \] for some rejection region \(R(\theta_0)\), defined so that the \(p\)-value is 5% when the null hypothesis is true: \[ P_{\theta_0}(\vec X \in R(\theta_0)) = 0.05. \] (In other words, the probability is calculated assuming that \(\vec X\) comes from a distribution where \(\theta = \theta_0\).)

Then, the random set (not necessarily an interval) \[ C(\vec X) \overset{\text{def}}{=}\{ \theta: \vec X \notin R(\theta) \} \] is a 95% confidence set for \(\theta\).

Proof

For every \(\theta_0\), we have \[ \begin{align} P_{\theta_0}(\theta_0 \in C(\vec X)) &= P_{\theta_0}(\vec X \notin R(\theta_0)) \\ &= 1 - P_{\theta_0}(\vec X \in R(\theta_0)) \\ &= 0.95. \end{align} \]

Proposition 48.2 says that we can “invert” a hypothesis test to obtain a confidence interval and vice versa. If we invert the \(z\)-test from Section 47.1, then we obtain the \(z\)-interval above.

Example 48.1 (Duality of the \(z\)-test and the \(z\)-interval) The \(z\)-test from Section 47.1 rejects the null hypothesis \(H_0: \mu = \mu_0\) when \[ |Z| = \left| \frac{\bar X - \mu_0}{\sqrt{\sigma^2 / n}} \right| > \Phi^{-1}(0.975). \]

In other words, this test does not reject when \[ \Phi^{-1}(0.025) \leq \frac{\bar X - \mu_0}{\sqrt{\sigma^2 / n}} \leq \Phi^{-1}(0.975). \]

If we solve for the values of \(\mu_0\) for which the test does not reject, we obtain \[ \bar X - \Phi^{-1}(0.025)\sqrt{\frac{\sigma^2}{n}} \geq \mu_0 \geq \bar X - \Phi^{-1}(0.975)\sqrt{\frac{\sigma^2}{n}}, \] which is precisely the \(z\)-interval from Section 48.1.

48.3 The \(t\)-interval

In the examples so far, we assumed the variance \(\sigma^2\) was known. What if it is not known?

In Section 47.2, we saw that for hypothesis testing, we can perform a \(t\)-test instead of a \(z\)-test. In the \(t\)-test, we replace \(\sigma^2\) by the sample variance \(S^2\). This introduces additional uncertainty, so instead of comparing to a standard normal distribution, we compare to a \(t\)-distribution.

We can use Proposition 48.2 to invert the \(t\)-test to obtain a confidence interval when \(\sigma^2\) is unknown. The result is called a \(t\)-interval.

Proposition 48.3 (\(t\)-interval for a normal mean) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\) random variables, where \(\sigma^2\) is unknown.

A 95% confidence interval for a mean \(\mu\) is \[ \left[\bar X - F_{t_{n-1}}^{-1}(0.975) \sqrt{\frac{S^2}{n}}, \bar X - F_{t_{n-1}}^{-1}(0.025) \sqrt{\frac{\sigma^2}{n}} \right] = \bar X \pm F_{t_{n-1}}^{-1}(0.975) \sqrt{\frac{S^2}{n}}, \] where \(F_{t_{n-1}}\) is the CDF of a \(t\)-distribution with \(n-1\) degrees of freedom and \(S^2\) is the sample variance (Definition 38.1).

Proof

The \(t\)-test from Section 47.2 rejects the null hypothesis \(H_0: \mu = \mu_0\) when \[ |T| = \left| \frac{\bar X - \mu_0}{\sqrt{S^2 / n}} \right| > F_{t_{n-1}}^{-1}(0.975). \] That is, the \(t\)-test compares the \(t\)-statistic to a \(t\)-distribution.

In other words, this test does not reject when \[ F_{t_{n-1}}^{-1}(0.025) \leq \frac{\bar X - \mu_0}{\sqrt{S^2 / n}} \leq F_{t_{n-1}}^{-1}(0.975). \]

If we solve for the values of \(\mu_0\) for which the test does not reject, we obtain \[ \bar X - F_{t_{n-1}}^{-1}(0.025)\sqrt{\frac{S^2}{n}} \geq \mu_0 \geq \bar X - F_{t_{n-1}}^{-1}(0.975)\sqrt{\frac{S^2}{n}}, \] which corresponds to the interval above.

Armed with the \(t\)-interval, we can calculate a 95% confidence interval for the average human body temperature.

Example 48.2 (Confidence interval for the average human body temperature) In Example 47.2, we saw that the average human body temperature is not \(98.6^\circ\) F, but what is it? We can use the data to form a 95% confidence interval.

48.4 Asymptotic confidence intervals

If \(X_1, \dots, X_n\) are not i.i.d. normal, then it is not easy to obtain an interval with exactly 95% probability of covering \(\mu\). However, we can still obtain an interval with approximate 95% coverage when \(n\) is large, thanks to the Central Limit Theorem.

Proposition 48.4 (Wald intervals) Let \(X_1, \dots, X_n\) be i.i.d. random variables with mean \(\mu\). Then, \[ Z \overset{\text{def}}{=}\frac{\bar X - \mu}{\sqrt{\sigma^2 / n}} \] is asymptotically standard normal by the Central Limit Theorem (Theorem 36.1). Therefore, the interval \[ \bar X \pm \Phi^{-1}(0.975) \sqrt{\frac{\sigma^2}{n}} \] has coverage \[ \begin{align} &P\left(\bar X - \Phi^{-1}(0.025)\sqrt{\frac{\sigma^2}{n}} \geq \mu \geq \bar X - \Phi^{-1}(0.975)\sqrt{\frac{\sigma^2}{n}}\right)\\ &= P(\Phi^{-1}(0.025) \leq Z \leq \Phi^{-1}(0.975)) \\ &\approx 0.95. \end{align} \]

If \(\sigma^2\) is unknown, then we can replace it with any consistent estimate \(\hat\sigma^2\) (such as the sample variance \(S^2\)), in which case \[ Z' \overset{\text{def}}{=}\frac{\bar X - \mu}{\sqrt{\hat\sigma^2 / n}} = \frac{\frac{\bar X - \mu}{\sqrt{\sigma^2 / n}}}{\sqrt{\hat\sigma^2 / \sigma^2}} \overset{d}{\to} \frac{Z}{1} \] by Slutsky’s theorem (see Theorem 37.3). Therefore, the interval \[ \bar X \pm \Phi^{-1}(0.975) \sqrt{\frac{\hat\sigma^2}{n}} \] also has approximate 95% coverage.

These asymptotic confidence intervals are known as Wald intervals. We now return to the skew die example that we encountered at the beginning of our exploration of statistical inference in Chapter 29 and construct a Wald interval for the probability \(p\) of rolling a six.

Example 48.3 (Wald interval for a binomial proportion) In Example 29.4, we rolled a skew die \(n = 25\) times and observed that six came up exactly \(X = 7\) times. Can we use this information to come up with a confidence interval for \(p\), the probability of rolling a six? We already know that the MLE of \(p\) is \(\hat p = \frac{7}{25} = 0.28\).

If we express \(X\) as a sum of Bernoulli random variables \(I_1 + \dots + I_n\), then we see that \[ \hat p = \frac{X}{n} = \frac{I_1 + \dots + I_n}{n} = \bar I \] and \(\mu = \text{E}\!\left[ I_1 \right] = p\). Therefore, we can use Proposition 48.4 to construct an approximate confidence interval for \(p\).

Since we do not know \(\sigma^2 = \text{Var}\!\left[ I_1 \right] = p(1 - p)\), we will estimate it by \(\hat\sigma^2 = \hat p(1 - \hat p)\), which is consistent for \(\sigma^2\) by the continuous mapping theorem (Theorem 37.2). (This turns out to be equivalent to the sample variance \(S^2\) of the Bernoulli random variables.)

Now, Proposition 48.4 says that the interval \[ \begin{align} \hat p \pm \Phi^{-1}(0.975) \sqrt{\frac{\hat p (1 - \hat p)}{n}} &\approx 0.28 \pm 1.96 \sqrt{\frac{0.28 (1 - 0.28)}{25}} \\ &\approx [0.104, 0.456] \end{align} \] is an approximate 95% confidence interval for \(p\).

The Wald interval is only guaranteed to have 95% coverage as \(n\to\infty\). How good is the interval for \(n=25\)? We can simulate 10000 realizations of \(X\) and see how often the interval covers a hypothetical true value of \(p\).

Shockingly, a 95% Wald interval only covers \(p = 0.12\) about 81% of the time. The coverage is better for other values of \(p\), but this example illustrates the dangers of relying on asymptotic results.

We can obtain an interval with better coverage by returning to first principles (Proposition 48.2). We start by deriving a test of \(H_0: p = p_0\) and then inverting this test to obtain a confidence interval.

Example 48.4 (Wilson interval for a binomial proportion) An asymptotic test of \(H_0: p = p_0\) does not reject when \[ \Phi^{-1}(0.025) \leq \underbrace{\frac{\hat p - p_0}{\sqrt{p_0(1-p_0) / n}}}_{Z} \leq \Phi^{-1}(0.975). \] Notice that \(p_0\) appears in both the numerator and denominator of \(Z\) because we are using the exact mean and variance of \(\hat p\) under the null hypothesis.

Solving for \(p_0\) requires some algebra. First, we square both sides to obtain \[ \frac{(\hat p - p_0)^2}{p_0(1-p_0) / n} \leq \Phi^{-1}(0.975)^2. \]

This can be rearranged into the quadratic inequality \[ \begin{align} (\hat p - p_0)^2 - \frac{\Phi^{-1}(0.975)^2}{n} p_0(1-p_0) &\leq 0 \\ \left(1 + \frac{\Phi^{-1}(0.975)^2}{n}\right)p_0^2 - \left(2\hat p + \frac{\Phi^{-1}(0.975)^2}{n}\right) p_0 + \hat p^2 &\leq 0. \end{align} \]

The quadratic inequality is satisfied for \(p_0\) lying between the roots of the quadratic polynomial. Therefore, the above inequality is equivalent to the interval \[ \begin{align} & \frac{1}{1 + \frac{\Phi^{-1}(0.975)^2}{n}} \left( \left(\hat p + \frac{\Phi^{-1}(0.975)^2}{2n}\right) \pm \sqrt{\left(\hat p + \frac{\Phi^{-1}(0.975)^2}{2n} \right)^2 - \left( 1 + \frac{\Phi^{-1}(0.975)^2}{n} \right) \hat p^2} \right) \\ &= \frac{1}{1 + \frac{\Phi^{-1}(0.975)^2}{n}} \left( \left(\hat p + \frac{\Phi^{-1}(0.975)^2}{2n}\right) \pm \Phi^{-1}(0.975) \sqrt{\frac{\hat p (1 - \hat p)}{n} + \frac{\Phi^{-1}(0.975)^2}{4n^2}} \right). \end{align} \]

This is called the Wilson interval for the binomial proportion. Comparing it with the Wald interval from Example 48.3, we see that:

  • the Wilson interval is scaled by \(\frac{1}{1 + \frac{\Phi^{-1}(0.975)^2}{n}}\),
  • the Wilson interval is centered around \(\hat p + \frac{\Phi^{-1}(0.975)^2}{2n}\) instead of \(\hat p\), and
  • the Wilson interval estimates \(\text{Var}\!\left[ \hat p \right]\) by \(\frac{\hat p (1 - \hat p)}{n} + \frac{\Phi^{-1}(0.975)^2}{4n^2}\) instead of \(\frac{\hat p (1 - \hat p)}{n}\).

Note that all of these adjustments become negligible as \(n\to\infty\), which makes sense because the Wald interval has asymptotic 95% coverage. Nevertheless, these small adjustments are enough to improve the coverage for finite \(n\), as the simulation below demonstrates.

Whereas the coverage of the Wald interval was 81%, the coverage of the Wilson interval is very close to 95%!

The Wilson interval is still asymptotic; it relies on the Central Limit Theorem and is only guaranteed to have 95% coverage as \(n \to \infty\). However, because it was derived by inverting a hypothesis test that used the exact variance \(\sigma^2 = p_0 (1 - p_0)\), instead of an estimate, it tends to perform much better than the Wald interval for smaller values of \(n\).