44  Multivariate Normal

In this chapter, we extend the normal distribution to random vectors. We already encountered bivariate normal distribution in Example 42.4, which was the two-dimensional generalization of the normal distribution. Now, we generalize the normal distribution to \(n\)-dimensions.

44.1 Definition and Properties

Definition 44.1 (Multivariate normal distribution) A random vector \(\vec X = (X_1, \dots, X_k)\) is said to have a multivariate normal distribution if it can be expressed as \[ \vec X = \underset{k \times m}{A} \vec Z + \vec \mu, \tag{44.1}\] where \(\vec Z = (Z_1, \dots, Z_m)\) is a random vector of i.i.d. standard normal random variables, \(A\) is an \(k \times m\) matrix, and \(\vec\mu\) is a constant vector in \(\mathbb{R}^n\).

In other words, the definition of the multivariate normal distribution is exactly analogous to the definition of the general normal distribution in Definition 22.4.

The mean vector and covariance matrix is easily determined from Equation 44.1.

Proposition 44.1 (Multivariate normal expectation and variance) \[\begin{align} \text{E}\!\left[ \vec X \right] &= \vec\mu & \text{Var}\!\left[ \vec X \right] &= AA^\intercal \end{align}\]

Proof

Since \(Z_1, \dots, Z_m\) i.i.d. standard normal, it follows that \[ \text{E}\!\left[ \vec Z \right] = \vec{0} \qquad \text{and} \qquad \text{Var}\!\left[ \vec Z \right] = I_{m \times m}. \] Hence, \[ \text{E}\!\left[ \vec X \right] = A \text{E}\!\left[ \vec Z \right] + \vec{\mu} = \vec{\mu} \] and \[ \text{Var}\!\left[ \vec X \right] = A \text{Var}\!\left[ \vec Z \right] A^\intercal = AA^\intercal \] by Proposition 43.1.

A multivariate normal random vector does not necessarily have a PDF. But it does when \(A\) is invertible, as we show next.

Proposition 44.2 (PDF of the multivariate normal distribution) Let \(\vec X\) be defined as in Equation 44.1, where \(A\) is a \(k \times k\) invertible matrix. Then, the PDF of \(\vec X\) is \[ f_{\vec X}(\vec x) = \frac{1}{(2\pi)^{k/2} |\det(A)|} e^{-\frac{1}{2} (\vec x - \vec\mu)^\intercal (AA^\intercal)^{-1} (\vec x - \vec\mu)}. \]

If we let \(\Sigma \overset{\text{def}}{=}AA^\intercal\) be the covariance matrix of \(\vec X\), then \[ f_{\vec X}(\vec x) = \frac{1}{(2\pi)^{k/2} \det(\Sigma)^{1/2}} e^{-\frac{1}{2} (\vec x - \vec\mu)^\intercal \Sigma^{-1} (\vec x - \vec\mu)}. \tag{44.2}\]

Note that this PDF is only well-defined when the covariance matrix \(\Sigma\) is invertible.

Proof

First, notice that the PDF of \(\vec Z\) is \[ f_{\vec Z}(\vec z) = \prod_{j=1}^k \frac{1}{\sqrt{2\pi}} e^{-z_j^2} = \frac{1}{(2\pi)^{k/2}} e^{-\vec z^\intercal \vec z}. \]

Since \(\vec X\) is a one-to-one transformation of \(\vec Z\), we can determine its PDF using Jacobians, as discussed in Chapter 42. Notice that \(\vec Z = A^{-1}(\vec X)\).

\[\begin{align} f_{\vec X}(\vec x) &= f_{\vec Z}(\vec z) \left|\det \left[\frac{\partial \vec z}{\partial \vec x} \right]\right| \\ &= f_{\vec Z}\big(A^{-1}(\vec x - \vec\mu)\big) \left|\det \left[\frac{\partial \vec z}{\partial \vec x} \right]\right| \\ &= \frac{1}{(2\pi)^{k/2}} e^{-\frac{1}{2}(A^{-1}(\vec x - \vec\mu))^\intercal (A^{-1}(\vec x - \vec\mu))} \left|\frac{1}{\det A}\right| \\ &= \frac{1}{(2\pi)^{k/2} |\det A|} e^{-\frac{1}{2}(\vec x - \vec\mu)^\intercal (AA^\intercal)^{-1} (\vec x - \vec\mu))}. \end{align}\]

We can obtain the final form (Equation 44.2) by substituting \(\Sigma\) for \(AA^\intercal\) and noting that \[ \det\Sigma = \det(AA^\intercal) = \det(A)\det(A^\intercal) = \det(A)^2. \]

Proposition 44.2 suggests that the multivariate normal distribution can be parametrized by the mean vector \(\vec\mu\) and covariance matrix \(\Sigma\). We write \[ \vec X \sim \textrm{MVN}(\vec\mu, \Sigma). \]

Check for yourself that Equation 44.2 reduces to the normal PDF when \(k=1\), with \(\sigma^2\) playing the role of \(\Sigma\).

Although \(A\) (and by extension, \(\Sigma\)) must be invertible for a multivariate normal to have a PDF, we do not assume that \(A\) is invertible in general.

44.2 Linear Combinations

Proposition 44.3 (Linear combinations of the multivariate normal) If \(\vec X = (X_1, \dots, X_k)\) is a multivariate normal random vector, then for any \(j \times k\) matrix \(C\) and \(j\)-vector \(\vec d\), \[ \vec Y = C \vec X + \vec d \] is also multivariate normal.

Proof

By definition, \(\vec X = A \vec Z + \vec \mu\) for some matrix \(A\) and vector \(\vec\mu\). We see that \[ \vec Y = C (A \vec Z + \vec \mu) + \vec d = (CA) \vec Z + (\vec\mu + \vec d) \] can also be expressed in such a form, with the matrix \(CA\) and the vector \(\vec\mu + \vec d\), so it is also multivariate normal.

Proposition 44.3 has many applications to statistics. For example, it can be used to study the distribution of \(X_i - \bar X\), a quantity known as the residual.

Example 44.1 (Distribution of the residual) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\). Let \(\bar X\) be the sample mean. We will determine the distribution of the vector \[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix}, \] which is called the vector of residuals.

If we let \(\vec X \overset{\text{def}}{=}(X_1, \dots, X_n)\), then we can write the vector of residuals as \[ \vec X - \vec 1 \bar X = \vec X - \frac{\vec 1 \vec 1^\intercal}{n} \vec X = (I - P) \vec X, \] where \(P \overset{\text{def}}{=}\frac{\vec 1 \vec 1^\intercal}{n}\). As we will see in Section 45.1, \(P\) is a special kind of matrix called a projection matrix.

By Proposition 44.3, \((I - P) \vec X\) is multivariate normal, with mean vector \[ \text{E}\!\left[ (I - P)\vec X \right] = (I - P) \text{E}\!\left[ \vec X \right] = (I - P) \vec 1 \mu = \vec 0 \] and covariance matrix \[ \text{Var}\!\left[ (I - P)\vec X \right] = (I - P) \underbrace{\text{Var}\!\left[ \vec X \right]}_{\sigma^2 I} (I - P)^\intercal = \sigma^2 \underbrace{(I - P)(I - P)^\intercal}_{I - P} = \sigma^2 (I - P). \] The last step above can be verified directly from the definition of \(P\): \[ \begin{align} (I - P)(I - P)^\intercal &= \left(I - \frac{\vec 1 \vec 1^\intercal}{n} \right) \left(I - \frac{\vec 1 \vec 1^\intercal}{n} \right) \\ &= I - 2 \frac{\vec 1 \vec 1^\intercal}{n} + \frac{\vec 1 \overbrace{\vec 1^\intercal \vec 1}^n \vec 1^\intercal}{n^2} \\ &= I - \frac{\vec 1 \vec 1^\intercal}{n} \\ &= I - P, \end{align} \] but it is a more general property of all projection matrices \(P\).

Therefore, the vector of residuals has the following distribution: \[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix} \sim \textrm{MVN}\left( \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \sigma^2\begin{bmatrix} 1 - \frac{1}{n} & -\frac{1}{n} & \dots & -\frac{1}{n} \\ -\frac{1}{n} & 1 - \frac{1}{n} & \dots & 1 - \frac{1}{n} \\ \vdots & \vdots & \ddots & \vdots \\ -\frac{1}{n} & -\frac{1}{n} & \dots & 1 - \frac{1}{n} \end{bmatrix} \right). \] In particular, although the \(X_i\)s are i.i.d., subtracting \(\bar X\) from all of them makes them negatively correlated.

Here are some other properties of the multivariate normal distribution that follow from Proposition 44.3.

Corollary 44.1 (Subvectors of a multivariate normal are also multivariate normal) Let \(\vec X = (X_1, \dots, X_k)\) be \(\textrm{MVN}(\vec\mu, \Sigma)\) and \(\vec X_m \overset{\text{def}}{=}(X_1, \dots, X_m)\) be a subvector of \(\vec X\) for \(m < k\). Then,

\[ \vec X_m \sim \textrm{MVN}(\vec\mu_m, \Sigma_{m}), \]

where \(\vec\mu_m \overset{\text{def}}{=}(\mu_1, \dots, \mu_m)\) is a subvector of \(\vec\mu\) and \(\Sigma_{m} \overset{\text{def}}{=}\begin{pmatrix} \Sigma_{11} & \dots & \Sigma_{1m} \\ \vdots & \ddots & \vdots \\ \Sigma_{m1} & \dots & \Sigma_{mm} \end{pmatrix}\) is a submatrix of \(\Sigma\).

Although we assumed that the subvector \(\vec X_m\) was the first \(m\) elements of \(\vec X\), the same result holds for any subset of elements from \(\vec X\).

Proof

Choosing \(C\) to be the \(m \times k\) matrix of the following form: \[ C = \begin{bmatrix} 1 & 0 & \dots & 0 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1 & 0 & \dots & 0 \end{bmatrix}. \]

It is easy to see that \(\vec X_m = C\vec X\) by “picking out” the elements of \(\vec X\). Therefore, by Proposition 44.3, \(\vec X_m\) is also multivariate normal.

To determine the parameters of this multivariate normal, we cite Proposition 43.1. The mean vector of \(\vec X_m\) will be \(\vec \mu_m = C\vec\mu\), and its covariance matrix will be \(\Sigma_m = C\Sigma C^\intercal\). It is straightforward to check that these matrix operations simply “pick out” the corresponding elements from \(\vec\mu\) and \(\Sigma\).

One important case of Proposition 44.3 is when \(C\) is a row vector, in which case we are taking linear combinations of the components of a multivariate normal.

Corollary 44.2 (Linear combination of the components of a multivariate normal) Let \(\vec X\) be a multivariate normal random vector. Then, for any linear combination of the components of \(\vec X\), the resulting random variable is normally distributed. That is, for any vector \(\vec c\), \[\vec c^\intercal \vec X = c_1 X_1 + \dots + c_k X_k \tag{44.3}\] follows a normal distribution. (Its expectation and variance can be derived using properties.)

Many interesting probabilities can be expressed in terms of Equation 44.3, as the next example shows.

44.3 Moment Generating Function and Independence

Proposition 44.4 (Moment generating function of the multivariate normal) If \(\vec X \sim \textrm{MVN}(\vec\mu, \Sigma)\), then the MGF is \[ M_{\vec X}(\vec t) = e^{\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal\Sigma \vec t}. \]

By definition \[ M_{\vec X}(\vec t) \overset{\text{def}}{=}\text{E}\!\left[ e^{\vec t \cdot \vec X} \right]. \] By Corollary 44.2, \(Y = \vec t \cdot \vec X\) follows a normal distribution, with mean \[ \text{E}\!\left[ Y \right] = \vec t \cdot \vec \mu \] and variance \[\begin{align} \text{Var}\!\left[ Y \right] &= \text{Cov}\!\left[ \vec t \cdot \vec X, \vec t \cdot \vec X \right] \\ &= \text{Cov}\!\left[ \sum_{i=1}^k t_i X_i, \sum_{j=1}^k t_j X_j \right] \\ &= \sum_{i=1}^k \sum_{j=1}^k t_i t_j \underbrace{\text{Cov}\!\left[ X_i, X_j \right]}_{\Sigma_{ij}} \\ &= \vec t^\intercal \Sigma \vec t. \end{align}\]

To evaluate \(\text{E}\!\left[ e^Y \right]\), the simplest way is to plug in \(t = 1\) into the MGF of the normal distribution (Example 34.2). Therefore, \[ M_{\vec X}(\vec t) = \text{E}\!\left[ e^Y \right] = e^{\vec t \cdot \vec \mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t}. \]

Because the MGF uniquely identifies a distribution, it can be used to prove remarkable results, such as the next result. We know that if two random variables are independent, then their covariance is zero by Proposition 25.2. However, the converse is not true in general; in Example 15.2, we encountered two random variables that had zero covariance but were not independent. However, if two random variables within a multivariate normal have covariance zero, then they are independent.

Theorem 44.1 Let \(\vec Y = (\vec X_1, \vec X_2)\) be a multivariate normal random vector. Then, \(\vec X_1\) and \(\vec X_2\) are independent if and only if \(\text{Cov}\!\left[ \vec X_1, \vec X_2 \right] = 0\).

Proof

The “only if” direction is true for all random vectors; see Proposition 25.2 and Definition 43.2. What is unique to the multivariate normal distribution is the converse.

First, we partition the mean vector and covariance matrix, in the same way we partitioned \(\vec Y\) into \(\vec X_1\) and \(\vec X_2\). \[\begin{align} \mu &= \text{E}\!\left[ \vec Y \right] = \begin{pmatrix} \text{E}\!\left[ \vec X_1 \right] \\ \text{E}\!\left[ \vec X_2 \right] \end{pmatrix} = \begin{pmatrix} \vec{\mu_1} \\ \vec{\mu_2} \end{pmatrix} \\ \Sigma &= \text{Var}\!\left[ \vec Y \right] = \begin{bmatrix} \text{Var}\!\left[ \vec X_1 \right] & \text{Cov}\!\left[ \vec X_1, \vec X_2 \right] \\ \text{Cov}\!\left[ \vec X_1, \vec X_2 \right]^\intercal & \text{Var}\!\left[ \vec X_2 \right]\end{bmatrix} = \begin{bmatrix} \Sigma_{11} & 0 \\ 0 & \Sigma_{22} \end{bmatrix}, \end{align}\] where \(\Sigma_{ii} \overset{\text{def}}{=}\text{Var}\!\left[ \vec X_i \right]\).

First, let’s figure out what the MGF of \(\vec Y\) would be if \(\vec X_1\) and \(\vec X_2\) were independent: \[\begin{align} M_{\vec Y}(\vec t) &= \text{E}\!\left[ e^{\vec t \cdot \vec Y} \right] \\ &= \text{E}\!\left[ e^{\vec t_1 \cdot \vec X_1 + \vec t_2 \cdot \vec X_2} \right] & (\vec t = (\vec t_1, \vec t_2)) \\ &= \text{E}\!\left[ e^{\vec t_1 \cdot \vec X_1} \right] \text{E}\!\left[ e^{\vec t_2 \cdot \vec X_2} \right] & \text{(by independence)} \\ &= e^{\vec t_1 \cdot \vec \mu_1 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1} e^{\vec t_2 \cdot \vec \mu_2 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2} & \text{(multivariate normal MGF)}. \end{align}\]

Now, let’s determine the MGF of \(\vec Y\) when the covariance is zero. We will see that it matches the MGF above, implying that \(\vec X_1\) and \(\vec X_2\) are independent. Recalling that \[\begin{align} M_{\vec Y}(\vec t) &= e^{\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t}, \end{align}\] the exponent is \[\begin{align} \vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t &= \begin{pmatrix} \vec t_1 \\ \vec t_2 \end{pmatrix} \cdot \begin{pmatrix} \vec \mu_1 \\ \vec \mu_2 \end{pmatrix} + \frac{1}{2} \begin{pmatrix} \vec t_1 & \vec t_2 \end{pmatrix} \begin{bmatrix} \Sigma_{11} & 0 \\ 0 & \Sigma_{22} \end{bmatrix} \begin{pmatrix} \vec t_1 \\ \vec t_2 \end{pmatrix} \\ &= \vec t_1 \cdot \vec\mu_1 + \vec t_2 \cdot \vec\mu_2 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2. \end{align}\]

Substituting this back into the MGF above and regrouping terms, we obtain \[\begin{align} M_{\vec Y}(\vec t) &= e^{\vec t_1 \cdot \vec\mu_1 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1} e^{ \vec t_2 \cdot \vec\mu_2 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2}, \end{align}\] which matches the MGF that we calculated above under the assumption of independence. Since the MGF uniquely determines the distribution, \(\vec X_1\) and \(\vec X_2\) must be independent when \(\text{Cov}\!\left[ \vec X_1, \vec X_2 \right] = 0\).

Theorem 44.1 is an incredibly useful result. It says that we can check independence within a multivariate normal vector by simply calculating a covariance, which is usually much easier.

Here is one fundamental and surprising application. If \(X_1, \dots, X_n\) are i.i.d. normal, then the sample mean \(\bar X\) and the sample variance \(S^2\) are actually independent.

Theorem 44.2 (Independence of the sample mean and variance for normal data) If \(X_1, \dots, X_n\) are i.i.d. \(\text{Normal}(\mu, \sigma^2)\) random variables, then \(\bar X\) and \(S^2\) are independent.

Proof

First, notice that a vector \(\vec X = (X_1, \dots, X_n)\) of i.i.d. normals is multivariate normal because it can be written as \[ \vec X = \sigma \vec Z + \mu \vec 1, \] where \(\vec Z\) is a vector of i.i.d. standard normal random variables and \(\vec 1\) is a vector of ones.

Next, we will consider the \((n+1)\)-dimensional random vector \[ \begin{pmatrix} \bar X \\ X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix} = \begin{bmatrix} \rule{2em}{0.5pt}& \frac{\vec 1^\intercal}{n} & \rule{2em}{0.5pt}\\ \\ & I - \frac{\vec 1 \vec 1^\intercal}{n} \\ \\ \\ \end{bmatrix} \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix}, \] which is also multivariate normal by Proposition 44.3.

Note that \(\bar X\) is the first element of this vector, and \(S^2\) depends on the remaining elements. If we can show that the cross-covariance between the first element and the remaining elements is zero, then they must be independent by Theorem 44.1.

\[ \begin{align} \text{Cov}\!\left[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix}, \bar X \right] &= \text{Cov}\!\left[ \left(I - \frac{\vec 1 \vec 1^\intercal}{n}\right) \vec X, \frac{\vec 1^\intercal}{n} \vec X \right]\\ &= \left(I - \frac{\vec 1 \vec 1^\intercal}{n}\right) \underbrace{\text{Var}\!\left[ \vec X \right]}_{\sigma^2 I} \frac{\vec 1}{n} \\ &= \sigma^2 \left(\frac{\vec 1}{n}- \frac{\vec 1 \vec 1^\intercal}{n} \frac{\vec 1}{n} \right) \\ &= \vec 0. \end{align} \]

In the second equality above, we used properties of covariance for random vectors (Proposition 43.3).

44.4 Multivariate Central Limit Theorem

The multivariate normal distribution is important in statistics for the same reason the normal distribution is. Just as the normal distribution arises from the sum of i.i.d. random variables by the Central Limit Theorem (Theorem 36.2), the multivariate normal distribution arises from the sum of i.i.d. random vectors.

Theorem 44.3 (Multivariate Central Limit Theorem) Let \(\vec X_1, \dots, \vec X_n\) be i.i.d. random vectors with \(\text{E}\!\left[ X_1 \right] = \vec\mu\) and \(\text{Var}\!\left[ X_1 \right] = \Sigma\). Then, \[ \frac{1}{\sqrt{n}} \left(\sum_{i=1}^n \vec X_i - n\vec \mu\right) \overset{d}{\to} \textrm{MVN}(\vec 0, \Sigma), \tag{44.4}\] or equivalently, \[ \sqrt{n} \left(\vec{\bar X} - \vec\mu\right) \overset{d}{\to} \textrm{MVN}(\vec 0, \Sigma) \tag{44.5}\]

We saw in the proof Proposition 43.8 that a multinomial random vector \(\vec X\) can be expressed as a sum of i.i.d. multinoullis. Therefore, Theorem 44.3 applies to the multinomial distribution.

Example 44.2 (Multivariate normal approximation to the multinomial) Let \(\vec X \sim \text{Multinomial}(n, \vec p)\). What can we say about the approximate distribution of \(\vec X\) when \(n\) is large?

We showed in Proposition 43.8 that \(\vec X = \vec I_1 + \cdot + \vec I_n\), where \(\vec I_1, \dots, \vec I_n\) are i.i.d. with \[\begin{align} \text{E}\!\left[ I_1 \right] &= \vec p \\ \text{Var}\!\left[ I_1 \right] &= \text{diag}(\vec p) - \vec p \vec p^\intercal. \end{align}\]

Therefore, by Theorem 44.3, \[ \frac{1}{\sqrt{n}} (\vec X - n\vec p) \overset{d}{\to} \textrm{MVN}(\vec 0, \text{diag}(\vec p) - \vec p \vec p^\intercal). \]

In other words, we can approximate \(\vec X\) as \[ \textrm{MVN}\big(n\vec p, n(\text{diag}(\vec p) - \vec p \vec p^\intercal)\big). \]

44.5 Exercises

Exercise 44.1 Let \(X \sim \text{Normal}(0,1)\). Let \(S\) be a random variable such that \[ P(S = 1) = \frac{1}{2} \qquad \text{and} \qquad P(S = -1) = \frac{1}{2}. \] In other words, \(S\) is a random sign, independent of \(X\).

  1. Show that \(Y = SX\) is standard normal.
  2. Show that \((X,Y)\) is not bivariate normal.

Remark. This is a counterexample to “if \(\vec{X}\) and \(\vec{Y}\) are MVN, then \((X,Y)\) is MVN.”

Exercise 44.2 Let \((X,Y)\) be bivariate normal with correlation coefficient \(\rho\), where \(X\) and \(Y\) are both standard normal.

  1. Show that \((X+Y, X-Y)\) is also bivariate normal.
  2. Find the joint PDF of \(X+Y\) and \(X-Y\) (without using calculus), assuming that \(-1 < \rho < 1\).

Exercise 44.3 Let \(X, Y, Z\) be i.i.d. \(\text{Normal}(0,1)\). Find the joint MGF of \((X+2Y, Y+3Z, Z+4X)\).

Exercise 44.4 Suppose \((X,Y)\) is bivariate normal with \(X \sim \text{Normal}(0,\sigma_1^2)\) and \(Y \sim \text{Normal}(0,\sigma_2^2)\) marginally and \(\text{Corr}[X,Y] = \rho\). Find a constant \(c\) so that \(Y - cX\) is independent of \(X\).