46 Properties of a Normal Sample

In statistics, we typically model data \(Y_1, \dots, Y_n\) as independent random variables. One of the most common models is \[ Y_i \sim \text{Normal}(\mu_i, \sigma^2), \tag{46.1}\] or equivalently, in random vector notation as \[ \vec Y \sim \textrm{MVN}(\vec \mu, \sigma^2 I). \]

Note that we do not assume that \(Y_1, \dots, Y_n\) are identically distributed; we allow their means to be different. However, we do assume that they all have the same variance. This assumption, called homoskedasticity (from Greek for “equal scatter”), is essential to simplifying calculations, as we will see.

Here is one situation where we might model data as independent normal, but with different means.

Example 46.1 (Linear regression) Let \(Y_1, \dots, Y_n\) represent the heights of boys ages 5 - 15. We know that children grow taller over time, so if we also know their ages \(x_1, \dots, x_n\), we might model their heights as independent random variables \[ Y_i \sim \text{Normal}(\alpha + \beta x_i, \sigma^2). \] That is, the mean of each child’s height \(\mu_i \overset{\text{def}}{=}\alpha + \beta x_i\) is a linear function of their age \(x_i\). In other words, it might be reasonable to assume that the heights of the boys are independent normal with the same variance, but with means that increase with age.

If we let \(\vec x \overset{\text{def}}{=}(x_1, \dots, x_n)\), then we can write \(Y_1, \dots, Y_n\) in random vector notation as \[ \vec Y \sim \textrm{MVN}(\vec\mu = \vec 1 \alpha + \vec x \beta, \Sigma = \sigma^2 I). \]

To estimate \(\alpha\), \(\beta\), and \(\sigma\), we can maximize the (log-)likelihood \[ \ell_{\vec Y}(\alpha, \beta, \sigma^2) = -\frac{n}{2} \log (2\pi \sigma^2) - \frac{1}{2} (\vec Y - (\underbrace{\vec 1 \alpha + \vec x \beta}_{\vec\mu}))^\intercal (\underbrace{\sigma^2 I}_{\Sigma})^{-1} (\vec Y - (\underbrace{\vec 1 \alpha + \vec x \beta}_{\vec\mu})). \tag{46.2}\]

A bit of algebra shows that the MLEs of \(\alpha\) and \(\beta\) can be obtained by solving \[ (\hat\alpha, \hat\beta) = \underset{\alpha,\beta}{\arg\min}\ \left\| \vec Y - \begin{bmatrix} \vec 1 & \vec x \end{bmatrix} \begin{pmatrix}\alpha \\ \beta \end{pmatrix} \right\|^2, \] which is a least-squares problem. Once \(\hat\alpha\) and \(\hat\beta\) are found, we set \[ \hat{\vec\mu} = \vec 1 \hat\alpha + \vec x \hat\beta \] and plug that into Equation 46.2 to obtain the MLE of \(\sigma^2\): \[ \hat\sigma^2 = \underset{\sigma^2}{\arg\max}\ \left( -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \left\| \vec{Y} - \hat{\vec{\mu}} \right\|^2 \right). \] Taking the derivative and setting it to 0 shows \[ \hat\sigma^2 = \frac{1}{n} \Big|\Big|\vec Y - \hat{\vec\mu}\Big|\Big|^2. \]

In this chapter, we will develop tools for analyzing the sampling distribution of the estimators \(\hat{\vec\mu}\) and \(\hat\sigma^2\).

46.1 Projection and Independence

In this section, we will generalize the result from Theorem 45.2. There, we showed that if \(\vec X\) is a vector of i.i.d. \(\text{Normal}(\mu, \sigma^2)\) random variables, then \[ \vec 1 \bar X = P_{\vec{1}} \vec X, \] where \(\displaystyle P_{\vec{1}} \overset{\text{def}}{=}\frac{\vec{1} \vec{1}^\intercal}{n}\), is independent of \[ \vec X - \vec 1 \bar X = (I - P_{\vec{1}})\vec X. \] That result turns out to be true more generally for any (orthogonal) projection matrix \(P\), as well as for independent normal random variables that need not have the same mean (only the same variance).

Definition 46.1 (Orthogonal projection matrix) Let \(P\) be an \(n \times n\) matrix. \(P\) is an orthogonal projection matrix if \(P^2 = P\) and \(P^\intercal = P\).

In particular, \(P\) is the orthogonal projection onto \(C(P) \subseteq \mathbb{R}^n\), where \(C(P)\) denotes the column space of \(P\).

Intuitively, \(P^2 = P\) because the vector is already in the subspace after the first projection, so any subsequent projections do nothing. This property is called idempotence (from Latin for “same power”).

Coming back to \(P_{\vec{1}}\), it has the following properties:

\(P_{\vec{1}} \vec{1} = \vec{1}\)
\(P_{\vec{1}} \vec{v} = \vec{0}\) for any \(\vec{v} \in C(\vec{1})^\perp\); i.e., \(\vec{v} \perp \vec{1}\).

Now, we will show that Theorem 45.2 also holds for independent normal random variables that do not necessarily have the same mean, as well as arbitrary projection matrices.

Theorem 46.1 (Independence of projection and residual) As defined as in Equation 46.1, let \[\vec Y = (Y_1, \dots, Y_n) \sim \textrm{MVN}(\vec\mu, \sigma^2 I),\] and let \(P\) be any (orthogonal) projection matrix.

Then, the projection \(P\vec Y\) and the vector of residuals \((I - P)\vec Y\) are independent.

Proof

First, note that \[ \begin{pmatrix} P \vec{Y} \\ (I - P)\vec{Y} \end{pmatrix} = \begin{bmatrix} P \\ I - P \end{bmatrix} \vec Y, \] is multivariate normal by Proposition 45.3. Therefore, if we can show that their cross-covariance is zero, then they must be independent by Theorem 45.1.

\[ \begin{align} \text{Cov}\!\left[ P\vec Y, (I - P)\vec Y \right] &= P \underbrace{\text{Var}\!\left[ \vec Y \right]}_{\sigma^2 I} (I - P)^\intercal \\ &= \sigma^2 P(I - P) \\ &= \sigma^2 (P - P^2) \\ &= 0_{n \times n}. \end{align} \]

Notice that the assumption of homoskedasticity was critical to the proof of Theorem 46.1. If \(\text{Var}\!\left[ \vec Y \right]\) were not proportional to the identity matrix \(I\), then we would not have been able to combine \(P\) and \((I - P)\) to form the zero matrix \(0_{n \times n}\).

Next, we apply Theorem 46.1 to Example 46.1.

Example 46.2 (Independence in linear regression) In Example 46.1, we noted that the MLEs of \(\alpha\) and \(\beta\) could be obtained by solving a least-squares problem: \[ (\hat\alpha, \hat\beta) = \underset{\alpha,\beta}{\arg\min}\ \left\| \vec Y - \begin{bmatrix} \vec 1 & \vec x \end{bmatrix} \begin{pmatrix}\alpha \\ \beta \end{pmatrix} \right\|^2 = \underset{\alpha,\beta}{\arg\min}\ \left\| \vec{Y} - \hat{\vec{\mu}} \right\|^2. \] The solution is obtained by projecting \(\vec Y\) onto \(\text{span}(\{ \vec 1, \vec x \})\). That is, \[ \hat{\vec \mu} = \underbrace{\begin{bmatrix} \vec 1 & \vec x \end{bmatrix}}_X \begin{pmatrix}\hat\alpha \\ \hat\beta \end{pmatrix} = P_X \vec Y, \] where \(P_X \overset{\text{def}}{=}X (X^\intercal X)^{-1} X^\intercal\) is the projection matrix onto \(C(X)\). Notice that \(P_X \vec Y\) is a vector representing the “normal” height for each child based on their age.

We also saw that the MLE of \(\sigma^2\) was \[ \hat\sigma^2 = \frac{1}{n} \left\| \vec Y - \hat{\vec\mu} \right\|^2, \] which can be written as \[ \frac{1}{n} \left\| (I - P_X)\vec Y \right\|^2. \] \((I - P_X)\vec Y\) is a vector representing how much each child differs from the norm for their age, and \(\hat\sigma^2\) is the average of these squared differences.

By Theorem 46.1, we know that \(P_X\vec Y\) and \((I - P_X)\vec Y\) are independent. Since \(\hat{\vec\mu} = P_X\vec Y\), while \(\hat\sigma^2\) is a function only of \((I - P_X)\vec Y\), they must be independent.

Finally, we make a simple but important observation. If \(P\) is a projection matrix, \((I - P)\) is also. It is also symmetric, and \[ (I - P)^2 = I - 2P + P^2 = I - 2P + P = I - P. \] In particular, \(P\) is matrix of orthogonal projection onto \(C(P)\), and \(I - P\) is the matrix of orthogonal projection onto \(C(P)^\intercal\).

46.2 Length and the Chi-Square Distribution

In this section, we will focus on the case where the data \(Z_1, \dots, Z_n\) are i.i.d. standard normal. That is, \[ \vec Z \overset{\text{def}}{=}(Z_1, \dots, Z_n) \sim \textrm{MVN}(\vec 0, I). \tag{46.3}\] We will discuss how the results here extend to the more general case (Equation 46.1) at the end.

First, we determine the distribution of the squared length of such a random vector: \[ ||\vec Z||^2 = Z_1^2 + Z_2^2 + \dots + Z_n^2. \] In Section 39.3, we took this to be the definition of the chi-square distribution with \(n\) degrees of freedom, or \(\chi^2_n\), which we saw was equivalent to the \(\textrm{Gamma}(\alpha= \frac{n}{2}, \lambda= \frac{1}{2})\) distribution.

Now, suppose we multiply \(\vec Z\) by a diagonal matrix \(D\) of the following form: \[ D = \begin{bmatrix} I_{k \times k} \\ & 0_{(n-k) \times (n-k)} \end{bmatrix}. \tag{46.4}\] Then, \(D\) zeroes out the elements of \(\vec Z\) after the first \(k\) so \[ || D \vec Z ||^2 = Z_1^2 + \dots + Z_k^2 \sim \chi_k^2. \tag{46.5}\] This seemingly trivial observation is the basis for a more profound result. It turns out that if \(P\) is any projection matrix onto a subspace of dimension \(k\), then \[ ||P\vec Z||^2 \sim \chi^2_k. \] The argument proceeds as follows:

We can rotate the axes so that in the new coordinate system, \(P\) is a diagonal matrix of the form Equation 46.4.
The distribution of \(\vec Z\) does not change under rotation; it is still Equation 46.3 in this new coordinate system.
In this new coordinate system, the distribution is clearly \(\chi_k^2\) by Equation 46.5.

We make this argument precise in Theorem 46.2. We start by establishing the second point, that the distribution of \(\vec Z\) does not change under rotation.

Lemma 46.1 (Rotating a multivariate normal) Let \(Q\) be any orthogonal matrix. That is, the columns of \(Q\) are an orthonormal basis of \(\mathbb{R}^n\) so that \(Q^\intercal Q = I\). Then, \(\vec Z' \overset{\text{def}}{=}Q\vec Z\) has the same distribution as \(\vec Z\). That is, \(\vec Z'\) is also a vector of i.i.d. standard normals.

Proof

The distribution of \(\vec Z' \overset{\text{def}}{=}Q\vec Z\) is also multivariate normal by Proposition 45.3. By Proposition 44.1, its mean vector is \[ \text{E}\!\left[ \vec Z' \right] = \text{E}\!\left[ Q\vec Z \right] = Q\text{E}\!\left[ \vec Z \right] = Q \vec 0 = \vec 0 \] and its covariance matrix is \[ \text{Var}\!\left[ \vec Z' \right] = \text{Var}\!\left[ Q\vec Z \right] = Q\underbrace{\text{Var}\!\left[ \vec Z \right]}_{\sigma^2 I} Q^\intercal = \sigma^2 \underbrace{QQ^\intercal}_I = \sigma^2 I, \] which together completely characterize a multivariate normal distribution. So \(\vec Z'\) has the same distribution as \(\vec Z\).

Here is the geometric intuition for this result. The PDF of \(\vec Z\) is \[ f_{\vec Z}(\vec z) = \frac{1}{(2\pi\sigma^2)^{n/2}} e^{-\frac{1}{2\sigma^2} \sum_{i=1}^n z_i^2}, \] which only depends on the distance from the origin, so it is spherically symmetric. On the other hand, the orthogonal matrix \(Q\) is simply a rotation around the origin. So multiplying by \(Q\) may change the value of \(\vec Z\), but does not change its distribution.

Figure 46.1: Distribution of an i.i.d. standard normal vector in 2 dimensions. Each contour line represents points of equal density. Rotating a point around the origin does not change the density.

Now we are ready to establish the main theorem, which is a special case of a more general result proved by the Scottish-American statistician Bill Cochran. Despite the theoretical nature of this result, Cochran was in fact the consummate applied statistician—for example, serving on the scientific advisory committee for the U.S. Surgeon General’s 1964 report that cigarette smoking causes lung cancer. Cochran served as the chair of the Johns Hopkins Department of Biostatistics, before moving to Harvard to found the Department of Statistics there.

William Cochran (1909-1980)

Theorem 46.2 (Cochran (1934)) Let \(P\) be a rank-\(k\) projection matrix; that is, it projects vectors onto a subspace of dimension \(k\). Then, \[ || P \vec Z ||^2 \sim \chi_k^2. \]

Proof

Since \(P\) projects onto a subspace of dimension \(k\), we can let \[\{ \vec q_1, \dots, \vec q_k \}\] be an orthonormal basis of this subspace, which exists by the Gram-Schmidt algorithm. Note that all of these vectors are eigenvectors of \(P\) with eigenvalue \(1\) because they are already in the subspace, so \[ P\vec q_j = \vec q_j; \qquad j=1, \dots, k. \]

We can extend this to an orthonormal basis of \(\mathbb{R}^n\). Applying Gram-Schmidt to \(\left\{ \vec{q_1}, \dots, \vec{q_k}, \vec{e_1}, \dots, \vec{e_n} \right\}\) yields \[ \left\{ \vec{q_1}, \dots, \vec{q_k}, \vec{q_{k+1}}, \dots, \vec{q_n} \right\}. \] Note that \(\left\{ \vec{q_{k+1}}, \dots, \vec{q_n} \right\}\) is an orthonormal basis of \(C(P)^\perp\). Hence,

\[ P\vec q_j = \vec 0; \qquad j=k+1, \dots, n. \]

Therefore, \(PQ = QD\), where \(Q\) is an orthogonal matrix consisting of the eigenvectors \[ Q \overset{\text{def}}{=}\begin{bmatrix} \vert & & \vert & \vert & & \vert \\ \vec q_1 & \dots & \vec q_k & \vec q_{k+1} & \dots & \vec q_n \\ \vert & & \vert & \vert & & \vert \end{bmatrix}, \] and \(D\) is a diagonal matrix consisting of the eigenvalues \[ D = \begin{bmatrix} I_{k \times k} \\ & 0_{(n-k) \times (n-k)} \end{bmatrix}. \] Equivalently, we can write \(P = QDQ^\intercal\), since \(Q^{-1} = Q^\intercal\).

Substituting this diagonalization into the above, we see that \[ || P\vec Z ||^2 = || (QDQ^\intercal)\vec Z ||^2 = \vec Z^\intercal (Q D \underbrace{Q^\intercal) (Q}_{I} D Q^\intercal) \vec Z = || D Q^\intercal \vec Z ||^2. \] But by Lemma 46.1, we know that \(Q^\intercal \vec Z\) has the same distribution as \(\vec Z\), so \(|| D Q^\intercal \vec Z||^2\) must have the same distribution as \(||D \vec Z ||^2 = Z_1^2 + \dots + Z_k^2\), which is \(\chi^2_k\).

Although it is usually unreasonable to model data as i.i.d. standard normal, data can often be reduced to standard normal random variables. The next result illustrates one such situation.

Theorem 46.3 (Distribution of the sample variance for normal data) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\), and let \(S^2\) be the sample variance. Then,

\[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1} \]

Proof

First, we observe that the random variable in question can be written as \[\begin{align} \frac{(n-1) S^2}{\sigma^2} &= \frac{\sum_{i=1}^n (X_i - \bar X)^2}{\sigma^2} \\ &= \sum_{i=1}^n \left(\frac{X_i - \mu}{\sigma} - \frac{\bar X - \mu}{\sigma}\right)^2 \\ &= \sum_{i=1}^n (Z_i - \bar Z)^2, \end{align}\] where \(Z_1, \dots, Z_n\) are i.i.d. standard normal.

Let \(\vec Z \overset{\text{def}}{=}(Z_1, \dots, Z_n)\). Then, we can write \[ \sum_{i=1}^n (Z_i - \bar Z)^2 = || \vec Z - \vec 1 \bar Z ||^2 = ||(I - P_{\vec{1}}) \vec Z||^2. \tag{46.6}\]

Since \(P_{\vec{1}}\) projects onto a \(1\)-dimensional subspace, \((I - P_{\vec{1}})\) projects onto an \((n-1)\)-dimensional subspace. By Theorem 46.2, \(\left\| (I - P_{\vec{1}})\vec Z \right\|^2\) follows a \(\chi_{n-1}^2\) distribution.

One consequence of Theorem 46.3 is that we can easily determine the bias and variance of \(S^2\) for normal data.

Example 46.3 (Bias and variance of the sample variance) In Chapter 38, we showed that \(S^2\) is unbiased for the variance \(\sigma^2\) for i.i.d. random variables from any distribution. We can verify this for i.i.d. normal random variables in particular using Theorem 46.3: \[ \text{E}\!\left[ S^2 \right] = \frac{\sigma^2}{n-1} \underbrace{\text{E}\!\left[ \chi_{n-1}^2 \right]}_{n-1} = \sigma^2, \] where we have abused notation by letting \(\chi_{n-1}^2\) denote a random variable with that distribution. The expectation of the \(\chi_{n-1}^2\) distribution follows from the fact that it is another name for the \(\textrm{Gamma}(\alpha= \frac{n-1}{2}, \lambda= \frac{1}{2})\) distribution, whose expectation we know to be \(\alpha/\lambda\).

Similarly, \[ \text{Var}\!\left[ S^2 \right] = \left(\frac{\sigma^2}{n-1}\right)^2 \underbrace{\text{Var}\!\left[ \chi_{n-1}^2 \right]}_{2(n-1)} = \frac{2\sigma^4}{n-1} \] because the variance of the gamma distribution is \(\alpha/\lambda^2\). However, this formula is only valid when the data are normal; there is no simple formula for the variance of \(S^2\) in general.

Finally, we will examine the sampling distribution of \(\hat\sigma^2\) from Example 46.1. Even though we showed in Example 46.2 that \[ \hat\sigma^2 = \frac{1}{n}|| (I - P)\vec Y ||^2, \] we cannot apply Theorem 46.2 directly because \(\vec Y\) does not have mean \(\vec 0\). However, this is only a minor setback, as we see next.

Corollary 46.1 (Alternative form of Cochran’s theorem) Let \(P\) be a rank-\(k\) projection matrix; that is, it projects vectors onto a subspace of dimension \(k\). Suppose \[ \vec W \sim \textrm{MVN}(\vec 0, P). \] Then, \[ || \vec W ||^2 \sim \chi^2_k. \]

Proof

Let \(\vec Z\) be defined as in Equation 46.3. Then, \(\vec W = P\vec Z\), since \[ P\vec Z \sim \textrm{MVN}(\vec\mu=\vec 0, \Sigma=\underbrace{PP^\intercal}_P), \] where we used the following properties of projection matrices to simplify the covariance matrix: \(P^\intercal = P\) and \(P^2 = P\). Therefore, \[ ||\vec W||^2 = ||P\vec Z||^2 \sim \chi^2_k \] by Theorem 46.2.

In other words, Corollary 46.1 says that the vector being projected need not have mean \(\vec 0\), as long as its projection has mean \(\vec 0\). As a result, Corollary 46.1 is applicable to more situations than Theorem 46.2, including \(\hat\sigma^2\) from Example 46.1.

Example 46.4 (Distribution of \(\hat\sigma^2\) in linear regression) Continuing Example 46.2, we first determine the distribution of \((I - P_X)\vec Y\). Since \[ \vec Y \sim \textrm{MVN}(\vec\mu = \vec 1 \alpha + \vec x \beta, \Sigma = \sigma^2 I), \] we have by Proposition 45.3 and Proposition 44.1 that \[ (I - P_X)\vec Y \sim \textrm{MVN}\big(\vec\mu = \underbrace{(I - P_X)(\vec 1 \alpha + \vec x \beta)}_{\vec 0}, \Sigma = \underbrace{(I - P_X) (\sigma^2 I) (I - P_X)^\intercal}_{\sigma^2 (I - P_X)}\big). \] To simplify the expressions, we used the fact that \(\vec 1 \alpha + \vec x \beta\) is already in the column span of \(X\), as well as properties of projection matrices.

Therefore, we see that \((I - P_X)\vec Y\) can be written as \(\sigma \vec W\), where \[ \vec W \sim \textrm{MVN}(\vec\mu=\vec 0, \Sigma=I - P_X), \] and since \(I - P_X\) is a projection matrix of rank \(n-2\), Corollary 46.1 applies to \(\vec W\).

Now, the MLE of \(\sigma^2\) can be written as \[ \hat\sigma^2 = \frac{1}{n} ||(I - P_X)\vec Y||^2 = \frac{\sigma^2}{n} ||\vec W||^2. \] so applying Corollary 46.1 to \(\vec W\) yields \[ \frac{n \hat\sigma^2}{\sigma^2} \sim \chi^2_{n-2}. \tag{46.7}\] In other words, \(\hat\sigma^2 \sim \textrm{Gamma}(\alpha= \frac{n-2}{2}, \lambda= \frac{n}{2\sigma^2})\).

Here is one useful application of Equation 46.7. Since we know the distribution, we can evaluate its expectation \[ \text{E}\!\left[ \hat\sigma^2 \right] = \frac{n-2}{n} \sigma^2 \tag{46.8}\] to see that the MLE is biased for \(\sigma^2\). However, Equation 46.8 also suggests a fix. We can obtain an unbiased estimator of the variance in linear regression by simply rescaling: \[ \hat\sigma^2_{\text{unbiased}} = \frac{n}{n-2} \hat\sigma^2 = \frac{1}{n-2} ||\vec Y - \hat{\vec\mu} ||^2. \]

This is the estimator of variance that is preferred in linear regression, and we now know its sampling distribution: \[ \frac{(n-2) \hat\sigma_{\text{unbiased}}^2}{\sigma^2} \sim \chi^2_{n-2}. \]

Notice that in Example 46.4, we were able to apply Corollary 46.1 to \((I - P_X)\vec Y\) only because it had mean \(\vec 0\). We cannot apply Corollary 46.1 to the complementary projection \(\hat{\vec\mu} = P_X \vec Y\), and in fact, the distribution of \(||\hat{\vec\mu}||^2\) has no simple description.

46.3 Exercises

Exercise 46.1 (Alternative proof of Theorem 46.3) We will prove Theorem 46.3 another way using moment generating functions. Let \(X_1, \dots, X_n\) be i.i.d \(\text{Normal}(\mu, \sigma^2)\) and let \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n \left( X_i - \bar{X} \right)^2. \]

First, consider \(Z_1, \dots, Z_n\) be i.i.d. \(\text{Normal}(0,1)\). Show that \[ \sum_{i=1}^n Z_i^2 = \sum_{i=1}^n \left( Z_i - \bar{Z} \right)^2 + n \bar{Z}^2. \]
Next, write down the MGF of both sides of the above result to determine the MGF of \[ \sum_{i=1}^n \left( Z_i - \bar{Z} \right)^2. \] It may be helpful to recall that \(\bar{Z}\) is independent of \(\sum \left( Z_i - \bar{Z}\right)^2\). Use the MGF to determine its distribution.
Finally, use location-scale transformations \(X_i = \mu + \sigma Z_i\) to conclude that \[ \frac{(n-1) S^2}{\sigma^2} \sim \chi_{n-1}^2. \]

Exercise 46.2 (Difference of means) Let \(X_1, \dots, X_m, Y_1, \dots, Y_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\).

What is the distribution of \(\bar X - \bar Y\)?
Let \(S_X^2\) be the sample variance of \(X_1, \dots, X_m\), and \(S_Y^2\) be the sample variance of \(Y_1, \dots, Y_n\). To estimate \(\sigma^2\), we could use either \(S_X^2\) or \(S_Y^2\), or better yet, we could use the pooled estimate of variance: \[ S_{\text{pooled}}^2 \overset{\text{def}}{=}\frac{(m-1) S_X^2 + (n-1) S_Y^2}{m+n-2}. \tag{46.9}\] Show that \(S_{\text{pooled}}^2\) is an unbiased estimator of \(\sigma^2\) and determine the exact distribution of \((m+n-2)\frac{S_{\text{pooled}}^2}{\sigma^2}\).
Argue that \(S_{\text{pooled}}^2\) is independent of \(\bar X - \bar Y\).

Exercise 46.3 (Analysis of variance) Suppose the data is divided into \(K\) groups, each one consisting of \(n_i\) i.i.d. observations \[ X_{i1}, X_{i2}, \dots, X_{i n_i} \sim \text{Normal}(\mu, \sigma^2). \]

We have assumed that every group has the same mean and variance, but what if we were not sure? One way to test this assumption is to compare the following:

the between-group sum of squares \(B = \sum_{i=1}^K n_i (\bar X_i - \bar X)^2\), where \(\bar X_i\) is the mean of group \(i\) and \(\bar X\) is the overall mean
the within-group sum of squares \(W = \sum_{i=1}^K \sum_{j=1}^{n_i} (X_{ij} - \bar X_i)^2\).

If \(B\) is large relative to \(W\), then the groups likely have different means. This is the idea behind a statistical technique called analysis of variance (or ANOVA, for short). In this exercise, you will derive some of the mathematics that underlie ANOVA.

Let \(\vec X = (X_{11}, \dots, X_{1 n_1}, \dots, X_{K1}, \dots, X_{K n_K})\) be the vector of all observations.

Express \(B = || P_1 \vec X ||^2\) and \(W = || P_2 \vec X ||^2\) for appropriate projection matrices \(P_1\) and \(P_2\).
Use the representation in (a) to derive the distributions of \(B\) and \(W\).
Show that \(B\) and \(W\) are independent.