44  Multivariate Normal

In this chapter, we extend the normal distribution to random vectors. We already defined a bivariate normal distribution in Example 42.4. Now we will generalize this to any number of random variables.

Definition 44.1 (Multivariate normal distribution) A random vector \(\vec X = (X_1, \dots, X_k)\) is said to have a multivariate normal distribution if it can be expressed as \[ \vec X = \underset{k \times m}{A} \vec Z + \vec \mu, \tag{44.1}\] where \(\vec Z = (Z_1, \dots, Z_m)\) is a random vector of i.i.d. standard normal random variables, \(A\) is an \(k \times m\) matrix, and \(\vec\mu\) is a constant vector in \(\mathbb{R}^n\).

In other words, the definition of the multivariate normal distribution is exactly analogous to the definition of the general normal distribution in Definition 22.4.

The mean vector and covariance matrix is easily determined from Equation 44.1.

Proposition 44.1 (Multivariate normal expectation and variance) \[\begin{align} \text{E}\!\left[ \vec X \right] &= \vec\mu & \text{Var}\!\left[ \vec X \right] &= AA^\intercal \end{align}\]

Proof

Since \(Z_1, \dots, Z_m\) i.i.d. standard normal, it follows that \[ \text{E}\!\left[ \vec Z \right] = \vec{0} \qquad \text{and} \qquad \text{Var}\!\left[ \vec Z \right] = I_{m \times m}. \] Hence, \[ \text{E}\!\left[ \vec X \right] = A \text{E}\!\left[ \vec Z \right] + \vec{\mu} = \vec{\mu} \] and \[ \text{Var}\!\left[ \vec X \right] = A \text{Var}\!\left[ \vec Z \right] A^\intercal = AA^\intercal \] by Proposition 43.1.

A multivariate normal random vector does not necessarily have a PDF. But it does when \(A\) is invertible, as we show next.

Proposition 44.2 (PDF of the multivariate normal distribution) Let \(\vec X\) be defined as in Equation 44.1, where \(A\) is a \(k \times k\) invertible matrix. Then, the PDF of \(\vec X\) is \[ f_{\vec X}(\vec x) = \frac{1}{(2\pi)^{k/2} |\det(A)|} e^{-\frac{1}{2} (\vec x - \vec\mu)^\intercal (AA^\intercal)^{-1} (\vec x - \vec\mu)}. \]

If we let \(\Sigma \overset{\text{def}}{=}AA^\intercal\) be the covariance matrix of \(\vec X\), then \[ f_{\vec X}(\vec x) = \frac{1}{(2\pi)^{k/2} \det(\Sigma)^{1/2}} e^{-\frac{1}{2} (\vec x - \vec\mu)^\intercal \Sigma^{-1} (\vec x - \vec\mu)}. \tag{44.2}\]

Note that this PDF is only well-defined when the covariance matrix \(\Sigma\) is invertible.

Proof

First, notice that the PDF of \(\vec Z\) is \[ f_{\vec Z}(\vec z) = \prod_{j=1}^k \frac{1}{\sqrt{2\pi}} e^{-z_j^2} = \frac{1}{(2\pi)^{k/2}} e^{-\vec z^\intercal \vec z}. \]

Since \(\vec X\) is a one-to-one transformation of \(\vec Z\), we can determine its PDF using Jacobians, as discussed in Chapter 42. Notice that \(\vec Z = A^{-1}(\vec X)\).

\[\begin{align} f_{\vec X}(\vec x) &= f_{\vec Z}(\vec z) \left|\det \left[\frac{\partial \vec z}{\partial \vec x} \right]\right| \\ &= f_{\vec Z}\big(A^{-1}(\vec x - \vec\mu)\big) \left|\det \left[\frac{\partial \vec z}{\partial \vec x} \right]\right| \\ &= \frac{1}{(2\pi)^{k/2}} e^{-(A^{-1}(\vec x - \vec\mu))^\intercal (A^{-1}(\vec x - \vec\mu))} \left|\frac{1}{\det A}\right| \\ &= \frac{1}{(2\pi)^{k/2} |\det A|} e^{-(\vec x - \vec\mu)^\intercal (AA^\intercal)^{-1} (\vec x - \vec\mu))}. \end{align}\]

We can obtain the final form (Equation 44.2) by substituting \(\Sigma\) for \(AA^\intercal\) and noting that \[ \det\Sigma = \det(AA^\intercal) = \det(A)\det(A^\intercal) = \det(A)^2. \]

From Proposition 44.2, we see that the multivariate normal distribution can be parametrized by the mean vector \(\vec\mu\) and covariance matrix \(\Sigma\). We write \[ \vec X \sim \text{MultivariateNormal}(\vec\mu, \Sigma). \]

Check for yourself that Equation 44.2 reduces to the normal PDF when \(k=1\), with \(\sigma^2\) playing the role of \(\Sigma\).

Although \(A\) (and by extension, \(\Sigma\)) needs to be invertible in order for a multivariate normal to have a PDF, we will not assume that \(A\) is invertible in general.

44.1 Linear Combinations

Proposition 44.3 (Linear combinations of the multivariate normal) If \(\vec X = (X_1, \dots, X_k)\) is a multivariate normal random vector, then for any \(j \times k\) matrix \(C\) and \(j\)-vector \(\vec d\), \[ \vec Y = C \vec X + \vec d \] is also multivariate normal.

Proof

By definition, \(\vec X = A \vec Z + \vec \mu\) for some matrix \(A\) and vector \(\vec\mu\). We see that \[ \vec Y = C (A \vec Z + \vec \mu) + \vec d = (CA) \vec Z + (\vec\mu + \vec d) \] is also of this form, so it is also multivariate normal.

Proposition 44.3 has many statistical applications. For example, it can be used to study the distribution of \(X_i - \bar X\).

Example 44.1 (Distribution of the residual) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\). Let \(\bar X\) be the sample mean. We will determine the distribution of the vector \[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix}, \] which is called the vector of residuals.

If we let \(\vec X \overset{\text{def}}{=}(X_1, \dots, X_n)\), then we can write the vector of residuals as \[ \vec X - \vec 1 \bar X = \vec X - \frac{\vec 1 \vec 1^\intercal}{n} \vec X = (I - P) \vec X, \] where \(P \overset{\text{def}}{=}\frac{\vec 1 \vec 1^\intercal}{n}\). As we will see in Section 45.1, \(P\) is a special kind of matrix called a projection matrix.

By Proposition 44.3, \((I - P) \vec X\) is multivariate normal, with mean vector \[ \text{E}\!\left[ (I - P)\vec X \right] = (I - P) \text{E}\!\left[ \vec X \right] = (I - P) \vec 1 \mu = \vec 0 \] and covariance matrix \[ \text{Var}\!\left[ (I - P)\vec X \right] = (I - P) \underbrace{\text{Var}\!\left[ \vec X \right]}_{\sigma^2 I} (I - P)^\intercal = \sigma^2 \underbrace{(I - P)(I - P)^\intercal}_{I - P} = \sigma^2 (I - P). \] The last step above is straightforward to verify by the definition of \(P\): \[ \begin{align} (I - P)(I - P)^\intercal &= \left(I - \frac{\vec 1 \vec 1^\intercal}{n} \right) \left(I - \frac{\vec 1 \vec 1^\intercal}{n} \right) \\ &= I - 2 \frac{\vec 1 \vec 1^\intercal}{n} + \frac{\vec 1 \overbrace{\vec 1^\intercal \vec 1}^n \vec 1^\intercal}{n^2} \\ &= I - \frac{\vec 1 \vec 1^\intercal}{n} \\ &= I - P, \end{align} \] but it is a more general property of all projection matrices \(P\).

Therefore, the vector of residuals has the following distribution: \[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix} \sim \text{MultivariateNormal}\left( \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \sigma^2\begin{bmatrix} 1 - \frac{1}{n} & -\frac{1}{n} & \dots & -\frac{1}{n} \\ -\frac{1}{n} & 1 - \frac{1}{n} & \dots & 1 - \frac{1}{n} \\ \vdots & \vdots & \ddots & \vdots \\ -\frac{1}{n} & -\frac{1}{n} & \dots & 1 - \frac{1}{n} \end{bmatrix} \right). \] In particular, although the \(X_i\)s are i.i.d., subtracting \(\bar X\) from all of them makes them correlated.

Here are some other properties of the multivariate normal distribution that follow from Proposition 44.3.

Corollary 44.1 (Subvectors of a multivariate normal are also multivariate normal) Let \(\vec X = (X_1, \dots, X_k)\) be \(\text{MultivariateNormal}(\vec\mu, \Sigma)\) and \(\vec X_m \overset{\text{def}}{=}(X_1, \dots, X_m)\) be a subvector of \(\vec X\) for \(m < k\). Then,

\[ \vec X_m \sim \text{MultivariateNormal}(\vec\mu_m, \Sigma_{m}), \]

where \(\vec\mu_m \overset{\text{def}}{=}(\mu_1, \dots, \mu_m)\) is a subvector of \(\vec\mu\) and \(\Sigma_{m} \overset{\text{def}}{=}\begin{pmatrix} \Sigma_{11} & \dots & \Sigma_{1m} \\ \vdots & \ddots & \vdots \\ \Sigma_{m1} & \dots & \Sigma_{mm} \end{pmatrix}\) is a submatrix of \(\Sigma\).

Although we assumed that the subvector \(\vec X_m\) was the first \(m\) elements of \(\vec X\), the same result holds for any subset of elements from \(\vec X\).

Proof

Choosing \(C\) to be the \(m \times k\) matrix of the following form: \[ C = \begin{bmatrix} 1 & 0 & \dots & 0 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1 & 0 & \dots & 0 \end{bmatrix}. \]

It is easy to see that \(\vec X_m = C\vec X\) by “picking out” the elements of \(\vec X\). Therefore, by Proposition 44.3, \(\vec X_m\) is also multivariate normal.

To determine the parameters of this multivariate normal, we turn to Proposition 43.1.
The mean vector of \(\vec X_m\) will be \(\vec \mu_m = C\vec\mu\) and its covariance matrix will be \(\Sigma_m = C\Sigma C^\intercal\). It is straightforward to check that these matrix operations simply “pick out” the corresponding elements from \(\vec\mu\) and \(\Sigma\).

One important case of Proposition 44.3 is when \(C\) is a row vector, in which case we are taking linear combinations of the components of a multivariate normal.

Corollary 44.2 (Linear combination of the components of a multivariate normal) Let \(\vec X\) be a multivariate normal random vector. Then, for any linear combination of the components of \(\vec X\), the resulting random variable is normally distributed. That is, for any vector \(\vec c\), \[\vec c^\intercal \vec X = c_1 X_1 + \dots + c_k X_k \tag{44.3}\] follows a normal distribution. (Its expectation and variance can be derived using properties.)

Many interesting probabilities can be expressed in terms of Equation 44.3, as the next example shows.

44.2 Moment Generating Functions and Independence

Proposition 44.4 (Moment generating function of the multivariate normal) If \(\vec X \sim \text{MultivariateNormal}(\vec\mu, \Sigma)\), then the MGF is \[ M_{\vec X}(\vec t) = e^{\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal\Sigma \vec t}. \]

By definition \[ M_{\vec X}(\vec t) \overset{\text{def}}{=}\text{E}\!\left[ e^{\vec t \cdot \vec X} \right]. \] By Corollary 44.2, \(Y = \vec t \cdot \vec X\) follows a normal distribution, with mean \[ \text{E}\!\left[ Y \right] = \vec t \cdot \vec \mu \] and variance \[\begin{align} \text{Var}\!\left[ Y \right] &= \text{Cov}\!\left[ \vec t \cdot \vec X, \vec t \cdot \vec X \right] \\ &= \text{Cov}\!\left[ \sum_{i=1}^k t_i X_i, \sum_{j=1}^k t_j X_j \right] \\ &= \sum_{i=1}^k \sum_{j=1}^k t_i t_j \underbrace{\text{Cov}\!\left[ X_i, X_j \right]}_{\Sigma_{ij}} \\ &= \vec t^\intercal \Sigma \vec t. \end{align}\]

To evaluate \(\text{E}\!\left[ e^Y \right]\), the simplest way is to plug in \(t = 1\) into the MGF of the normal distribution (Example 34.2). Therefore, \[ M_{\vec X}(\vec t) = \text{E}\!\left[ e^Y \right] = e^{\vec t \cdot \vec \mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t}. \]

Because the MGF uniquely identifies a distribution, it can be used to prove remarkable results like the following. We know that if two random variables are independent, then their covariance is zero. In general, the converse is not true; see Example 15.2 for an example of two random variables that have covariance zero but are not independent. However, if two random variables within a multivariate normal have covariance zero, then they are independent.

Theorem 44.1 Let \(\vec Y = (\vec X_1, \vec X_2)\) be a multivariate normal random vector. Then, \(\vec X_1\) and \(\vec X_2\) are independent if and only if \(\text{Cov}\!\left[ \vec X_1, \vec X_2 \right] = 0\).

Proof

The “only if” direction is true for all random vectors; see Proposition 25.2 and Definition 43.2. What is unique to the multivariate normal distribution is the “if” direction.

First, we partition the mean vector and covariance matrix, just as we partitioned \(\vec Y\) into \(\vec X_1\) and \(\vec X_2\). \[\begin{align} \mu &= \text{E}\!\left[ \vec Y \right] = \begin{pmatrix} \text{E}\!\left[ \vec X_1 \right] \\ \text{E}\!\left[ \vec X_2 \right] \end{pmatrix} = \begin{pmatrix} \vec{\mu_1} \\ \vec{\mu_2} \end{pmatrix} \\ \Sigma &= \text{Var}\!\left[ \vec Y \right] = \begin{bmatrix} \text{Var}\!\left[ \vec X_1 \right] & \text{Cov}\!\left[ \vec X_1, \vec X_2 \right] \\ \text{Cov}\!\left[ \vec X_1, \vec X_2 \right]^\intercal & \text{Var}\!\left[ \vec X_2 \right]\end{bmatrix} = \begin{bmatrix} \Sigma_{11} & 0 \\ 0 & \Sigma_{22} \end{bmatrix}, \end{align}\] where \(\Sigma_{ii} \overset{\text{def}}{=}\text{Var}\!\left[ \vec X_i \right]\).

First, let’s figure out what the MGF of \(\vec Y\) would be if \(\vec X_1\) and \(\vec X_2\) were independent: \[\begin{align} M_{\vec Y}(\vec t) &= \text{E}\!\left[ e^{\vec t \cdot \vec Y} \right] \\ &= \text{E}\!\left[ e^{\vec t_1 \cdot \vec X_1 + \vec t_2 \cdot \vec X_2} \right] & (\vec t = (\vec t_1, \vec t_2)) \\ &= \text{E}\!\left[ e^{\vec t_1 \cdot \vec X_1} \right] \text{E}\!\left[ e^{\vec t_2 \cdot \vec X_2} \right] & \text{(by independence)} \\ &= e^{\vec t_1 \cdot \vec \mu_1 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1} e^{\vec t_2 \cdot \vec \mu_2 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2} & \text{(multivariate normal MGF)}. \end{align}\]

Now, let’s determine the MGF of \(\vec Y\) when the covariance is zero. We will see that it matches the MGF above, implying that \(\vec X_1\) and \(\vec X_2\) are independent. Recalling that \[\begin{align} M_{\vec Y}(\vec t) &= e^{\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t}, \end{align}\] the exponent is \[\begin{align} \vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t &= \begin{pmatrix} \vec t_1 \\ \vec t_2 \end{pmatrix} \cdot \begin{pmatrix} \vec \mu_1 \\ \vec \mu_2 \end{pmatrix} + \frac{1}{2} \begin{pmatrix} \vec t_1 & \vec t_2 \end{pmatrix} \begin{bmatrix} \Sigma_{11} & 0 \\ 0 & \Sigma_{22} \end{bmatrix} \begin{pmatrix} \vec t_1 \\ \vec t_2 \end{pmatrix} \\ &= \vec t_1 \cdot \vec\mu_1 + \vec t_2 \cdot \vec\mu_2 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2. \end{align}\]

Substituting this back into the MGF above and regrouping terms, we obtain \[\begin{align} M_{\vec Y}(\vec t) &= e^{\vec t_1 \cdot \vec\mu_1 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1} e^{ \vec t_2 \cdot \vec\mu_2 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2}, \end{align}\] which matches the MGF that we calculated above under the assumption of independence. Since the MGF uniquely determines the distribution, \(\vec X_1\) and \(\vec X_2\) must be independent when \(\text{Cov}\!\left[ \vec X_1, \vec X_2 \right] = 0\).

Theorem 44.1 is an incredibly powerful result. It says that we can check independence within a multivariate normal vector by simply calculating a covariance, which is usually much easier.

Here is one fundamental and surprising application. If \(X_1, \dots, X_n\) are i.i.d. normal, then the sample mean \(\bar X\) and the sample variance \(S^2\) are actually independent.

Theorem 44.2 (Independence of the sample mean and variance for normal data) If \(X_1, \dots, X_n\) are i.i.d. \(\text{Normal}(\mu, \sigma^2)\) random variables, then \(\bar X\) and \(S^2\) are independent.

Proof

First, notice that a vector \(\vec X = (X_1, \dots, X_n)\) of i.i.d. normals is multivariate normal because it can be written as \[ \vec X = \sigma \vec Z + \mu \vec 1, \] where \(\vec Z\) is a vector of i.i.d. standard normal random variables and \(\vec 1\) is a vector of ones.

Next, we will consider the \((n+1)\)-dimensional random vector \[ \begin{pmatrix} \bar X \\ X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix} = \begin{bmatrix} \rule{2em}{0.5pt}& \frac{\vec 1^\intercal}{n} & \rule{2em}{0.5pt}\\ \\ & I - \frac{\vec 1 \vec 1^\intercal}{n} \\ \\ \\ \end{bmatrix} \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix}, \] which is also multivariate normal by Proposition 44.3.

Note that \(\bar X\) is the first element of this vector, and \(S^2\) depends on the remaining elements. If we can show that the cross-covariance between the first element and the remaining elements is zero, then they must be independent by Theorem 44.1.

\[ \begin{align} \text{Cov}\!\left[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix}, \bar X \right] &= \text{Cov}\!\left[ \left(I - \frac{\vec 1 \vec 1^\intercal}{n}\right) \vec X, \frac{\vec 1^\intercal}{n} \vec X \right]\\ &= \left(I - \frac{\vec 1 \vec 1^\intercal}{n}\right) \underbrace{\text{Var}\!\left[ \vec X \right]}_{\sigma^2 I} \frac{\vec 1}{n} \\ &= \sigma^2 \left(\frac{\vec 1}{n}- \frac{\vec 1 \vec 1^\intercal}{n} \frac{\vec 1}{n} \right) \\ &= \vec 0. \end{align} \]

In the second equality above, we used properties of covariance for random vectors (Proposition 43.3).

44.3 Multivariate Central Limit Theorem

The multivariate normal distribution is important in statistics for the same reason the normal distribution is. Just as the normal distribution as the sum of i.i.d. random variables, the multivariate normal distribution arises as the sum of i.i.d. random vectors.

Theorem 44.3 (Multivariate Central Limit Theorem) Let \(\vec X_1, \dots, \vec X_n\) be i.i.d. random vectors with \(\text{E}\!\left[ X_1 \right] = \vec\mu\) and \(\text{Var}\!\left[ X_1 \right] = \Sigma\). Then, \[ \frac{1}{\sqrt{n}} \left(\sum_{i=1}^n \vec X_i - n\vec \mu\right) \overset{d}{\to} \text{MultivariateNormal}(\vec 0, \Sigma), \tag{44.4}\] or equivalently, \[ \sqrt{n} \left(\vec{\bar X} - \mu\right) \overset{d}{\to} \text{MultivariateNormal}(\vec 0, \Sigma) \tag{44.5}\]

We saw in the proof Proposition 43.8 that a multinomial random vector \(\vec X\) can be expressed as a sum of i.i.d. multinoullis. Therefore, Theorem 44.3 applies to the multinomial distribution.

Example 44.2 (Multivariate normal approximation to the multinomial) Let \(\vec X \sim \text{Multinomial}(n, \vec p)\). What can we say about the approximate distribution of \(\vec X\) when \(n\) is large?

We showed in Proposition 43.8 that \(\vec X = \vec I_1 + \cdot + \vec I_n\), where \(\vec I_1, \dots, \vec I_n\) are i.i.d. with \[\begin{align} \text{E}\!\left[ I_1 \right] &= \vec p \\ \text{Var}\!\left[ I_1 \right] &= \text{diag}(\vec p) - \vec p \vec p^\intercal. \end{align}\]

Therefore, by Theorem 44.3, \[ \frac{1}{\sqrt{n}} (\vec X - n\vec p) \overset{d}{\to} \text{MultivariateNormal}(\vec 0, \text{diag}(\vec p) - \vec p \vec p^\intercal). \]

In other words, we can approximate \(\vec X\) as \[ \text{MultivariateNormal}\big(n\vec p, n(\text{diag}(\vec p) - \vec p \vec p^\intercal)\big). \]

44.4 Exercises