47 Hypothesis Testing

In Chapter 29, we discussed how statistics can be regarded as the inverse of probability. Whereas probability quantifies the chances of observing data from a given model, statistics allows us to infer properties of the model from the observed data. So far, we have focused on only one mode of statistical inference: estimation. In this chapter, we will examine another mode of statistical inference: hypothesis testing.

For example, consider a company that manufactures engine ball bearings, such as the one shown in Figure 47.1. The diameter of the inner ring must be manufactured to a high degree of precision to ensure that a shaft or axle fits into the ball bearing.

One model that the company manufactures is designed to have an inner diameter of exactly $10.00\text{mm}$. However, the manufacturing process is not perfectly precise, so under normal manufacturing conditions, the actual inner diameter of the ball bearings are normally distributed around $10.00\text{mm}$, with a standard deviation of $0.03\text{mm}$. That is, the inner diameters $X_i$ are i.i.d. $\textrm{Normal}(\mu= 10.00, \sigma^2= 0.03^2)$.

The quality control department at the company measures 5 ball bearings each day to ensure that the inner diameter remains within specifications. One day, the quality control department measures \[ X_1 = 10.06, X_2 = 10.07, X_3 = 9.98, X_4 = 10.02, X_5 = 10.09. \]

Figure 47.1: Diagram of a ball bearing. The inner ring rotates, while the outer ring is stationary; the balls between the two rings reduce friction as the inner ring rotates.

If $\mu$ represents the actual target inner diameter of the ball bearings that day, we know that we can estimate $\mu$ by the maximum likelihood estimator \[ \bar X = 10.044. \]

However, the company is less interested in estimating $\mu$ than in determining whether $\mu$ has deviated from the target value of $10.00\text{mm}$, in which case the manufacturing process may need to be adjusted. This is a classic application of hypothesis testing.

47.1 The $z$-test

The goal of hypothesis testing is to determine whether the data provides evidence for or against a null hypothesis. In the ball bearing example, the null hypothesis is that the process is “in control”; that is, the true inner diameter of the ball bearings is $10.00\text{mm}$. We denote this null hypothesis by: \[ H_0 : \mu = 10.00 \]

However, we observed that the average inner diameter of 5 ball bearings manufactured on that day was $\bar X = 10.044\text{mm}$. Although this differs from the target diameter of $10.00\text{mm}$, this may simply have been an unlucky batch rather than indicating a problem in the manufacturing process. Can we chalk up this difference to chance, or does this data provide evidence against the null hypothesis?

To answer this question, we quantify how likely it is to observe $\bar X = 10.044$ if in fact $\mu = 10.00$. If it is very unlikely, then we conclude that this did not happen by chance and reject the null hypothesis.

Example 47.1 (Quality control and the $z$-test) Under the null hypothesis, the inner diameter of each ball bearing is a random variable \[X_i \sim \textrm{Normal}(\mu= 10.00, \sigma^2= 0.03^2).\]

Therefore, the sample mean of 5 randomly selected ball bearings would be \[ \bar X \sim \textrm{Normal}(\mu= 10.00, \sigma^2= \frac{0.03^2}{5}). \]

How likely is it to observe $\bar X = 10.044$ under the null hypothesis? To quantify this, we calculate \[ P(|\bar X - 10.00| > |10.044 - 10.00|), \] the probability that $\bar X$ is as far from $10.00$ as $10.044$ is (or further). If this probability, called the p-value, is very small, then we conclude that $\bar X = 10.044$ is unlikely to have happened by chance and reject the null hypothesis.

The easiest way to calculate the p-value is to first standardize $\bar X = 10.044$ as \[ z = \frac{10.044 - 10.00}{\sqrt{0.03^2 / 5}} \approx 3.28, \] so that the p-value can be calculated using the standard normal distribution: \[ \begin{align} P(|\bar X - 10.00| > |10.044 - 10.00|) &\approx P(|Z| > 3.28) \\ &= 2(1 - \Phi(3.28)) \\ &\approx 0.001. \end{align} \]

Since the p-value is only 0.1%, we reject the null hypothesis. That is, the observed value of $\bar X = 10.044$ cannot plausibly be explained by chance variation, so there is likely a problem in the manufacturing process.

How low does the p-value need to be before we reject the null hypothesis? Typically, statisticians use a threshold of 5%. If the p-value is below 5%, then they reject the null hypothesis; otherwise, they do not reject it.

Finally, we summarize the general procedure for the $z$-test.

Assume $X_1, \dots, X_n$ are i.i.d. $\text{Normal}(\mu, \sigma^2)$, where $\sigma^2$ is known, and we want to test the null hypothesis \[ H_0: \mu = \mu_0. \]
Calculate the sample mean $\bar X$ of the observed data. Call this value $\bar x$.
Calculate the test statistic \[ z = \frac{\bar x - \mu_0}{\sqrt{\sigma^2 / n}}. \]
Calculate the p-value \[ p = P(|Z| > |z|), \] where $Z$ is a standard normal random variable.
If $p < .05$, then we reject the null hypothesis $H_0$. Otherwise, we do not reject it.

47.2 The $t$-test

There is one major limitation of the $z$-test: it requires knowing the population variance $\sigma^2$. While this may be reasonable in quality control applications, where this population variance is known from years of manufacturing, this is not the case for most applications.

For example, Mackowiak, Wasserman, and Levine (1992) wanted to evaluate whether the average human body temperature really is $98.6^\circ\text{F}$, a number that originated with the 19th century physician Carl Reinhold August Wunderlich. To evaluate Wunderlich’s reference point, Mackowiak, Wasserman, and Levine (1992) measured the body temperatures (in ${}^\circ\text{F}$) of $n=130$ patients; their data is reproduced below.

Mackowiak, Wasserman, and Levine (1992) were interested in testing the null hypothesis \[ H_0: \mu = 98.6. \]

Human body temperatures are approximately normally distributed, as the above histogram shows. However, we do not know the population variance $\sigma^2$ of human body temperatures. Therefore, we cannot use the test statistic \[ Z = \frac{\bar{X} - \mu_0}{\sqrt{\sigma^2/n}} \tag{47.1}\] to test this hypothesis.

One natural idea is to replace $\sigma^2$ with an estimate, such as the sample variance $S^2$ (Definition 38.1). That is, we can instead use the test statistic \[ T = \frac{\bar X - \mu_0}{\sqrt{S^2 / n}}. \tag{47.2}\]

However, because of the additional variability from $S^2$, Equation 47.2 will not follow a standard normal distribution. Instead, it follows a new distribution, which we derive now.

First, we will rewrite Equation 47.2 in a more abstract and general form: \[ T = \frac{\frac{\bar X - \mu}{\sqrt{\sigma^2/n}}}{\sqrt{\frac{S^2}{\sigma^2}}} = \frac{Z}{\sqrt{W / (n-1)}}. \tag{47.3}\] The numerator is simply a standard normal random variable $Z$, while the denominator is the square root of a $\chi^2_{n-1}$ random variable $W$ divided by $(n-1)$. This follows from Theorem 46.3, which says that \[ (n-1)\frac{S^2}{\sigma^2} \sim \chi^2_{n-1}. \] Furthermore, by Theorem 45.2, we know that $\bar X$ and $S^2$ are independent, so $Z$ (which only depends on $\bar X$) and $W$ (which only depends on $S^2$) are also independent.

The upshot of this discussion is that to find the distribution of Equation 47.2, we need to find the distribution of a random variable of the following form.

Definition 47.1 (The $t$ distribution) A random variable $T_k$ is said to follow a $t$ distribution with $k$ degrees of freedom if it can be expressed as \[ T_k = \frac{Z}{\sqrt{W_k / k}}, \tag{47.4}\] where $Z$ and $W_k$ are independent standard normal and $\chi^2_k$ random variables, respectively.

Notice that the statistic Equation 47.2 corresponds to Definition 47.1 with $k = n - 1$. However, the $t$ distribution arises in other contexts besides this one; see Exercise 46.2 for an example.

Next, we derive the PDF of the $t$ distribution.

Proposition 47.1 (PDF of the $t$ distribution) The PDF of the $t$ distribution with $k$ degrees of freedom (Definition 47.1) is \[ f(x) = \frac{\Gamma(\frac{k+1}{2})}{\Gamma(\frac{k}{2})} \frac{1}{\sqrt{\pi k}} \left(1 + \frac{t^2}{k} \right)^{-(k+1)/2}. \tag{47.5}\]

Proof

We derive the distribution of this random variable using scale transformations, but an alternative approach is to use Jacobians (see Exercise 47.5).

Notice that we can generate $T = T_k$ by first generating $W = W_k$ and then generating $T | W$:

$W \sim \chi^2_k$
$T\,|\, \{ W = w \} \overset{d}{=} \frac{1}{\sqrt{w/k}} Z \sim \textrm{Normal}(\mu= 0, \sigma^2= \frac{1}{w / k})$

This works because $W$ and $Z$ are independent, so the value of $W$ simply scales the standard normal random variable $Z$.

Now, we can calculate the marginal distribution of $T$ using the Law of Total Probability (Proposition 26.1): \[ \begin{align} f_T(t) &= \int_0^\infty f_{W}(w) \cdot f_{T | W}(t|w)\,dw \\ &= \int_0^\infty \frac{(\frac{1}{2})^{k/2}}{\Gamma(\frac{k}{2})} w^{k/2-1} e^{-\frac{1}{2} w} \cdot \frac{\sqrt{w/k}}{\sqrt{2\pi}} e^{-\frac{w/k}{2} t^2} \,dw \\ &= \frac{(\frac{1}{2})^{(k+1)/2}}{\Gamma(\frac{k}{2})}\frac{1}{\sqrt{\pi k}} \int_0^\infty w^{(k+1)/2 - 1} e^{-\frac{1}{2}\left(1 + \frac{t^2}{k}\right) w}\,dw. \end{align} \]

After pulling out constants, it is apparent that the integrand is proportional to a gamma PDF with $\alpha = (k+1)/2$ and $\lambda = \frac{1}{2}\left(1 + \frac{t^2}{k} \right)$. For clarity, we rewrite the above expression in terms of $\lambda = \lambda_t$ (with the subscript reminding us that $\lambda$ still depends on $t$), then add the necessary normalization constants so that the integrand is exactly a gamma PDF, which must integrate to one: \[ \begin{align} &= \frac{(\frac{1}{2})^{(k+1)/2}}{\Gamma(\frac{k}{2})} \frac{1}{\sqrt{\pi k}} \int_0^\infty w^{(k+1)/2 - 1} e^{-\lambda_t w}\,dw \\ &= \frac{(\frac{1}{2})^{(k+1)/2}}{\Gamma(\frac{k}{2})} \frac{1}{\sqrt{\pi k}} \frac{\Gamma(\frac{k+1}{2})}{\lambda_t^{(k+1)/2}} \underbrace{\int_0^\infty \frac{\lambda_t^{(k+1)/2}}{\Gamma(\frac{k+1}{2})} w^{(k+1)/2 - 1} e^{-\lambda_t w}\,dw}_{=1} \\ &= \frac{\Gamma(\frac{k+1}{2})}{\Gamma(\frac{k}{2})} \frac{1}{\sqrt{\pi k}} \left(2\lambda_t \right)^{-(k+1)/2}. \end{align} \]

Finally, we substitute back the expression for $\lambda_t$. \[ f_T(t) = \frac{\Gamma(\frac{k+1}{2})}{\Gamma(\frac{k}{2})} \frac{1}{\sqrt{\pi k}} \left(1 + \frac{t^2}{k} \right)^{-(k+1)/2}. \]

The PDF Equation 47.5 is graphed in Figure 47.2 for different degrees of freedom $k$. Notice that, like the standard normal distribution, the $t$ distribution is symmetric around zero. However, it has “heavier” tails—that is, more probability in the tails.

Figure 47.2: Comparison of the $t$ distribution for different degrees of freedom.

In practice, we rarely need the formula for the PDF (Equation 47.5), since we can simply use functions like pt(..., df) in R to evaluate probabilities under a $t$ distribution. However, the formula for the PDF allows us to describe theoretically just how much heavier the tails of the $t$ distribution are. \[ \begin{align} f(x) \propto \left(1 + \frac{x^2}{k} \right)^{-(k+1) / 2} &\sim \frac{k^{(k+1)/2}}{x^{k+1}} & \text{(as $x \to \infty)$.} \end{align} \] Ignoring the numerator (which is just a constant), the decay is polynomial ($1 / x^{k+1}$), which is much slower than the super-exponential decay ($e^{-x^2}$) of the normal distribution.

We can use the $t$ distribution to carry out the hypothesis test for the mean body temperature example.

Example 47.2 (Body temperatures and the $t$-test) To calculate the test statistic in Equation 47.2, we will need the sample mean $\bar X$ and the sample variance $S^2$.

Now, the test statistic is \[ t = \frac{98.24923 - 98.6}{\sqrt{0.537558 / 130}} \approx -5.45. \]

The corresponding p-value is \[ p = P(|T| > 5.45), \] which we evaluate using R:

The p-value is about $.0000002$, which is certainly less than 5%, so we reject the null hypothesis that the average human body temperature is $98.6^\circ \text{F}$.

If we had instead used the standard normal distribution to calculate the p-value, we would have obtained a smaller probability.

This makes sense because, as Figure 47.2 shows, the $t$ distribution has heavier tails than the standard normal distribution. One way to see why is to compare the original test statistics $Z$ (Equation 47.1) and $T$ (Equation 47.2); extra variability is introduced when $\sigma^2$ is replaced by $S^2$.

Figure 47.2 illustrates another phenomenon: the $t$ distribution approaches the standard normal distribution as the degrees of freedom $k$ increases. Again, the test statistic provides some insight here; as $n \to\infty$, $S^2$ is very close to $\sigma^2$, so there is essentially no difference between using $S^2$ or $\sigma^2$ when $n$ is large. The next proposition formalizes this intuition.

Proposition 47.2 (Limiting distribution of the $t$-statistic) Let $T_k$ be a random variable with a $t$ distribution with $k$ degrees of freedom, as defined in Definition 47.1. Then, as $k\to\infty$, \[ T_k \overset{d}{\to} \textrm{Normal}(\mu= 0, \sigma^2= 1). \]

Proof

By Definition 47.1, we can write \[ T_k = \frac{Z}{\sqrt{W_k / k}}, \] where $Z$ and $W_k$ are independent standard normal and $\chi^2_k$ random variables.

Now, we saw in Section 39.3 that a $\chi^2_k$ random variable is the sum of squares of $k$ independent standard normal random variables: \[ W_k = Z_1^2 + \dots + Z_k^2. \] Therefore, by the Law of Large Numbers (Theorem 28.2), \[ W_k / k \overset{p}{\to} \text{E}\!\left[ Z_1^2 \right] = 1, \] and by the continuous mapping theorem (Theorem 37.2), \[ \sqrt{W_k / k} \overset{p}{\to} \sqrt{1} = 1. \]

Combining this fact with the fact that $Z \overset{d}{\to} Z$ (trivially), we know by Slutsky’s theorem (Theorem 37.3) that \[ T_k = \frac{Z}{\sqrt{W_k / k}} \overset{d}{\to} \frac{Z}{1} = Z, \] which is a standard normal random variable.

The practical consequence of Proposition 47.2 is that there is little difference between using the $z$-test or the $t$-test when $n$ is large.

Finally, we summarize the general procedure for the $t$-test.

Assume $X_1, \dots, X_n$ are i.i.d. $\text{Normal}(\mu, \sigma^2)$, where $\sigma^2$ is unknown, and we want to test the null hypothesis \[ H_0: \mu = \mu_0. \]
Calculate the sample mean $\bar X$ and sample variance $S^2$ of the observed data. Call these values $\bar x$ and $s^2$.
Calculate the test statistic \[ t = \frac{\bar x - \mu_0}{\sqrt{s^2 / n}}. \]
Calculate the p-value \[ p = P(|T_{n-1}| > |t|), \] where $T_{n-1}$ is a $t$-distributed random variable with $(n-1)$ degrees of freedom.
If $p < .05$, then we reject the null hypothesis $H_0$. Otherwise, we do not reject it.

47.3 The $\chi^2$-test

Here is a hypothesis test with a different flavor. Suppose we observe a multinomial random vector $\vec X$, and we want to test whether a particular vector $\vec p$ of probabilities is plausible.

For example, suppose we roll a die $n = 100$ times and count how many times each face comes up: \[ \vec X = (15, 13, 26, 12, 16, 18). \] Is this evidence that the die is loaded? That is, is a fair die with \[ \vec p = (\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}) \] a plausible model for this data?

Here is one possible test statistic. We can compare the observed count $X_j$ to the expected count $n p_j$, normalizing by the expected count (since the difference will naturally be larger for categories with a higher count), then aggregate this difference across the categories. That is, \[ W = \sum_{j=1}^k \frac{(X_j - np_j)^2}{np_j}. \tag{47.6}\] Large values of $W$ would imply that $\vec p$ is not a good model for the data.

To determine how large $W$ needs to be to reject the null hypothesis, we need to know the distribution of $W$. The exact distribution is impractical to derive, but an approximate distribution is possible with the help of the Multivariate Central Limit Theorem (Theorem 45.3).

Proposition 47.3 (Distribution of the $\chi^2$-statistic) Let $W$ be defined as in Equation 47.6. Then, as $n \to\infty$, \[ W \overset{d}{\to} \chi^2_{k-1}. \]

Proof

We can express $W$ as the squared length of a random vector: \[ W = \Big|\Big| \text{diag}(n\vec p)^{-1/2} (\vec X - n\vec p) \Big|\Big|^2. \]

But this random vector is approximately normal by the Multivariate Central Limit Theorem. To be precise, we showed in Example 45.2 that \[ \frac{1}{\sqrt{n}} (\vec X - n\vec p) \overset{d}{\to} \text{MVN}(\vec 0, \text{diag}(\vec p) - \vec p \vec p^T), \tag{47.7}\] so scaling this by $\text{diag}(\vec p)^{-1/2}$, we obtain \[ \begin{align} \text{diag}(n\vec p)^{-1/2} (\vec X - n\vec p) &\overset{d}{\to} \textrm{MVN}(\vec 0, \text{diag}(\vec p)^{-1/2} (\text{diag}(\vec p) - \vec p \vec p^T) \text{diag}(\vec p)^{-1/2}). \end{align} \] If we define $\sqrt{\vec p} \overset{\text{def}}{=}(\sqrt{p_1}, \dots, \sqrt{p_k})$, then the covariance matrix simplifies to $I - \sqrt{\vec p} \sqrt{\vec p}^T$.

But $||\sqrt{\vec p}||^2 = \sum_{j=1}^k \sqrt{p_j}^2 = 1$, so $\sqrt{\vec p} \sqrt{\vec p}^T$ is simply the projection matrix onto the vector $\sqrt{\vec p}$. (You can verify that it preserves the vector $\sqrt{\vec p}$.)

Therefore, the covariance matrix $I - \sqrt{\vec p} \sqrt{\vec p}^T$ is also a projection matrix, with rank $k-1$. By Theorem 46.2, the squared length of such a multivariate normal vector follows a $\chi^2_{k-1}$ distribution.

Now, let’s apply Proposition 47.3 to the dice data. We calculate the test statistic and compare it to the $\chi^2_5$ distribution, as predicted by theory.

The p-value is $.177$, which is not small enough to reject the null hypothesis that this is a fair die.

Finally, we conclude with a simulation study. Let’s compare the simulated distribution of $W$ under the null hypothesis, with the $\chi^2_5$ distribution predicted by Proposition 47.3.

The approximation is very accurate!

47.4 Exercises

Exercise 47.1 When simulating basketball games, some people a Gaussian random number generator (a random sample from $\text{Normal}(0,1)$) to assign a “heat index” to each player. For example, if Steph Curry’s field goal probability is known to be $0.471$ with standard deviation $0.105$, and the random number generator outputs $0.453$ for Steph, then for that day, Steph will have a $+0.453$ SD performance; i.e., his shooting percentage that day will be \[ 0.471 + 0.453 \cdot 0.105 = 0.519. \] This updated percentage will be used for Steph in simulating games that day.

One day, you are simulating games involving $232$ players, and the $232$ numbers generated have mean $0.31$. Is there enough evidence to suspect that our Gaussian random number generator is not sampling from a mean $0$ distribution?

Exercise 47.2 What if we repeat the same process in Exercise 47.1, but instead of using a Gaussian random number generator, we instead sample from $\text{Uniform}(-2,2)$ in order to assign heat indices to the players?

Suppose we generate $232$ numbers, and they have mean $0.31$. If $X_1, \dots, X_{232}$ represent a random sample of size $232$ from $\text{Uniform}(-2,2)$, we cannot directly use the $z$-test as we are not sampling from a normal distribution. However, it turns out we can, by using the Central Limit Theorem.

What is the approximate distribution of $\bar{X}$, according to the Central Limit Theorem?
Use the result to determine the p-value of the observed outcome. Is there enough evidence to suspect that our uniform random number generator is not sampling from a mean $0$ distribution?

Exercise 47.3 (Hypothesis test for a binomial proportion) Suppose we observe $X \sim \text{Binomial}(n, p)$ and want to test the null hypothesis \[ H_0: p = p_0. \]

Explain in detail how you would carry out this hypothesis test using a test statistic of the form \[ z = \frac{X - \fbox{???}}{\fbox{???}}. \]

What is the (approximate) distribution of this test statistic? What conditions would need to be satisfied for this test to be valid?

Exercise 47.4 Stout beers are flavored with the flowers of the plant humulus lupulus, widely known as hops. The trademark bitter flavor comes from resins, a semisolid substance from the glands of the hops.

A particular brewery wants to the proportion of soft resins to be 8.5%. In a batch of $12$ samples, the brewmaster gets the proportion of soft resins \[ 7.8, 8.5, 8.6, 8.2, 8.0, 8.4, 8.2, 7.7, 8.2, 8.2, 8.1, 8.0 \] Is there enough evidence to conclude that the machine is producing beer with the proportion of soft resins not equal to 8.5%?

Exercise 47.5 (Alternative derivation of the $t$ distribution) Derive the joint distribution of $T = \frac{Z}{\sqrt{W / (n-1)}}$ and $U = W / (n-1)$.

Use this joint distribution to derive the PDF of $T$.

Exercise 47.6 (Hypothesis test for a difference of means) Let $X_1, \dots, X_m$ be i.i.d. $\text{Normal}(\mu_1, \sigma^2)$ and $Y_1, \dots, Y_n$ be i.i.d. $\text{Normal}(\mu_2, \sigma^2)$.

Under the null hypothesis that $\mu_1 = \mu_2$, we are in the situation of Exercise 46.2. Using the results from that exercise, derive the distribution of the test statistic

\[ T = \frac{\bar X - \bar Y}{\sqrt{\frac{S_{\text{pooled}}^2}{m} + \frac{S_{\text{pooled}}^2}{n}}}. \]

47 Hypothesis Testing

47.1 The \(z\)-test

47.2 The \(t\)-test

47.3 The \(\chi^2\)-test

47.4 Exercises