In this chapter, we extend the normal distribution to random vectors. We already encountered bivariate normal distribution in Example 43.4, which was the two-dimensional generalization of the normal distribution. Now, we generalize the normal distribution to \(n\)-dimensions.
Definition and Properties
Definition 45.1 (Multivariate normal distribution) A random vector \(\vec X = (X_1, \dots, X_k)\) is said to have a multivariate normal distribution if it can be expressed as \[
\vec X = \underset{k \times m}{A} \vec Z + \vec \mu,
\tag{45.1}\] where \(\vec Z = (Z_1, \dots, Z_m)\) is a random vector of i.i.d. standard normal random variables, \(A\) is an \(k \times m\) matrix, and \(\vec\mu\) is a constant vector in \(\mathbb{R}^n\).
In other words, the definition of the multivariate normal distribution is exactly analogous to the definition of the general normal distribution in Definition 22.4.
The mean vector and covariance matrix is easily determined from Equation 45.1.
Proposition 45.1 (Multivariate normal expectation and variance) \[\begin{align}
\text{E}\!\left[ \vec X \right] &= \vec\mu & \text{Var}\!\left[ \vec X \right] &= AA^\intercal
\end{align}\]
Since \(Z_1, \dots, Z_m\) i.i.d. standard normal, it follows that \[
\text{E}\!\left[ \vec Z \right] = \vec{0} \qquad \text{and} \qquad \text{Var}\!\left[ \vec Z \right] = I_{m \times m}.
\] Hence, \[
\text{E}\!\left[ \vec X \right] = A \text{E}\!\left[ \vec Z \right] + \vec{\mu} = \vec{\mu}
\] and \[
\text{Var}\!\left[ \vec X \right] = A \text{Var}\!\left[ \vec Z \right] A^\intercal = AA^\intercal
\] by Proposition 44.1.
A multivariate normal random vector does not necessarily have a PDF. But it does when \(A\) is invertible, as we show next.
Proposition 45.2 (PDF of the multivariate normal distribution) Let \(\vec X\) be defined as in Equation 45.1, where \(A\) is a \(k \times k\) invertible matrix. Then, the PDF of \(\vec X\) is \[
f_{\vec X}(\vec x) = \frac{1}{(2\pi)^{k/2} |\det(A)|} e^{-\frac{1}{2} (\vec x - \vec\mu)^\intercal (AA^\intercal)^{-1} (\vec x - \vec\mu)}.
\]
If we let \(\Sigma \overset{\text{def}}{=}AA^\intercal\) be the covariance matrix of \(\vec X\), then \[
f_{\vec X}(\vec x) = \frac{1}{(2\pi)^{k/2} \det(\Sigma)^{1/2}} e^{-\frac{1}{2} (\vec x - \vec\mu)^\intercal \Sigma^{-1} (\vec x - \vec\mu)}.
\tag{45.2}\]
Note that this PDF is only well-defined when the covariance matrix \(\Sigma\) is invertible.
First, notice that the PDF of \(\vec Z\) is \[ f_{\vec Z}(\vec z) = \prod_{j=1}^k \frac{1}{\sqrt{2\pi}} e^{-z_j^2} = \frac{1}{(2\pi)^{k/2}} e^{-\vec z^\intercal \vec z}. \]
Since \(\vec X\) is a one-to-one transformation of \(\vec Z\), we can determine its PDF using Jacobians, as discussed in Chapter 43. Notice that \(\vec Z = A^{-1}(\vec X)\).
\[\begin{align}
f_{\vec X}(\vec x) &= f_{\vec Z}(\vec z) \left|\det \left[\frac{\partial \vec z}{\partial \vec x} \right]\right| \\
&= f_{\vec Z}\big(A^{-1}(\vec x - \vec\mu)\big) \left|\det \left[\frac{\partial \vec z}{\partial \vec x} \right]\right| \\
&= \frac{1}{(2\pi)^{k/2}} e^{-\frac{1}{2}(A^{-1}(\vec x - \vec\mu))^\intercal (A^{-1}(\vec x - \vec\mu))} \left|\frac{1}{\det A}\right| \\
&= \frac{1}{(2\pi)^{k/2} |\det A|} e^{-\frac{1}{2}(\vec x - \vec\mu)^\intercal (AA^\intercal)^{-1} (\vec x - \vec\mu))}.
\end{align}\]
We can obtain the final form (Equation 45.2) by substituting \(\Sigma\) for \(AA^\intercal\) and noting that \[
\det\Sigma = \det(AA^\intercal) = \det(A)\det(A^\intercal) = \det(A)^2.
\]
Proposition 45.2 suggests that the multivariate normal distribution can be parametrized by the mean vector \(\vec\mu\) and covariance matrix \(\Sigma\). We write \[ \vec X \sim \textrm{MVN}(\vec\mu, \Sigma). \]
Check for yourself that Equation 45.2 reduces to the normal PDF when \(k=1\), with \(\sigma^2\) playing the role of \(\Sigma\).
Although \(A\) (and by extension, \(\Sigma\)) must be invertible for a multivariate normal to have a PDF, we do not assume that \(A\) is invertible in general.
Linear Combinations
Proposition 45.3 (Linear combinations of the multivariate normal) If \(\vec X = (X_1, \dots, X_k)\) is a multivariate normal random vector, then for any \(j \times k\) matrix \(C\) and \(j\)-vector \(\vec d\), \[
\vec Y = C \vec X + \vec d
\] is also multivariate normal.
By definition, \(\vec X = A \vec Z + \vec \mu\) for some matrix \(A\) and vector \(\vec\mu\). We see that \[
\vec Y = C (A \vec Z + \vec \mu) + \vec d = (CA) \vec Z + (\vec\mu + \vec d)
\] can also be expressed in such a form, with the matrix \(CA\) and the vector \(\vec\mu + \vec d\), so it is also multivariate normal.
Proposition 45.3 has many applications to statistics. For example, it can be used to study the distribution of \(X_i - \bar X\), a quantity known as the residual.
Example 45.1 (Distribution of the residual) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Normal}(\mu, \sigma^2)\). Let \(\bar X\) be the sample mean. We will determine the distribution of the vector \[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix}, \] which is called the vector of residuals.
If we let \(\vec X \overset{\text{def}}{=}(X_1, \dots, X_n)\), then we can write the vector of residuals as \[
\vec X - \vec 1 \bar X = \vec X - \frac{\vec 1 \vec 1^\intercal}{n} \vec X = (I - P) \vec X,
\] where \(P \overset{\text{def}}{=}\frac{\vec 1 \vec 1^\intercal}{n}\). As we will see in Section 46.1, \(P\) is a special kind of matrix called a projection matrix.
By Proposition 45.3, \((I - P) \vec X\) is multivariate normal, with mean vector \[
\text{E}\!\left[ (I - P)\vec X \right] = (I - P) \text{E}\!\left[ \vec X \right] = (I - P) \vec 1 \mu = \vec 0
\] and covariance matrix \[
\text{Var}\!\left[ (I - P)\vec X \right] = (I - P) \underbrace{\text{Var}\!\left[ \vec X \right]}_{\sigma^2 I} (I - P)^\intercal = \sigma^2 \underbrace{(I - P)(I - P)^\intercal}_{I - P} = \sigma^2 (I - P).
\] The last step above can be verified directly from the definition of \(P\): \[
\begin{align}
(I - P)(I - P)^\intercal &= \left(I - \frac{\vec 1 \vec 1^\intercal}{n} \right) \left(I - \frac{\vec 1 \vec 1^\intercal}{n} \right) \\
&= I - 2 \frac{\vec 1 \vec 1^\intercal}{n} + \frac{\vec 1 \overbrace{\vec 1^\intercal \vec 1}^n \vec 1^\intercal}{n^2} \\
&= I - \frac{\vec 1 \vec 1^\intercal}{n} \\
&= I - P,
\end{align}
\] but it is a more general property of all projection matrices \(P\).
Therefore, the vector of residuals has the following distribution: \[
\begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix} \sim \textrm{MVN}\left( \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \sigma^2\begin{bmatrix} 1 - \frac{1}{n} & -\frac{1}{n} & \dots & -\frac{1}{n} \\ -\frac{1}{n} & 1 - \frac{1}{n} & \dots & 1 - \frac{1}{n} \\ \vdots & \vdots & \ddots & \vdots \\ -\frac{1}{n} & -\frac{1}{n} & \dots & 1 - \frac{1}{n} \end{bmatrix} \right).
\] In particular, although the \(X_i\)s are i.i.d., subtracting \(\bar X\) from all of them makes them negatively correlated.
Here are some other properties of the multivariate normal distribution that follow from Proposition 45.3.
Corollary 45.1 (Subvectors of a multivariate normal are also multivariate normal) Let \(\vec X = (X_1, \dots, X_k)\) be \(\textrm{MVN}(\vec\mu, \Sigma)\) and \(\vec X_m \overset{\text{def}}{=}(X_1, \dots, X_m)\) be a subvector of \(\vec X\) for \(m < k\). Then,
\[ \vec X_m \sim \textrm{MVN}(\vec\mu_m, \Sigma_{m}), \]
where \(\vec\mu_m \overset{\text{def}}{=}(\mu_1, \dots, \mu_m)\) is a subvector of \(\vec\mu\) and \(\Sigma_{m} \overset{\text{def}}{=}\begin{pmatrix} \Sigma_{11} & \dots & \Sigma_{1m} \\ \vdots & \ddots & \vdots \\ \Sigma_{m1} & \dots & \Sigma_{mm} \end{pmatrix}\) is a submatrix of \(\Sigma\).
Although we assumed that the subvector \(\vec X_m\) was the first \(m\) elements of \(\vec X\), the same result holds for any subset of elements from \(\vec X\).
Choosing \(C\) to be the \(m \times k\) matrix of the following form: \[
C = \begin{bmatrix} 1 & 0 & \dots & 0 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1 & 0 & \dots & 0 \end{bmatrix}.
\]
It is easy to see that \(\vec X_m = C\vec X\) by “picking out” the elements of \(\vec X\). Therefore, by Proposition 45.3, \(\vec X_m\) is also multivariate normal.
To determine the parameters of this multivariate normal, we cite Proposition 44.1. The mean vector of \(\vec X_m\) will be \(\vec \mu_m = C\vec\mu\), and its covariance matrix will be \(\Sigma_m = C\Sigma C^\intercal\). It is straightforward to check that these matrix operations simply “pick out” the corresponding elements from \(\vec\mu\) and \(\Sigma\).
One important case of Proposition 45.3 is when \(C\) is a row vector, in which case we are taking linear combinations of the components of a multivariate normal.
Corollary 45.2 (Linear combination of the components of a multivariate normal) Let \(\vec X\) be a multivariate normal random vector. Then, for any linear combination of the components of \(\vec X\), the resulting random variable is normally distributed. That is, for any vector \(\vec c\), \[\vec c^\intercal \vec X = c_1 X_1 + \dots + c_k X_k \tag{45.3}\] follows a normal distribution. (Its expectation and variance can be derived using properties.)
Many interesting probabilities can be expressed in terms of Equation 45.3, as the next example shows.
Moment Generating Function and Independence
Proposition 45.4 (Moment generating function of the multivariate normal) If \(\vec X \sim \textrm{MVN}(\vec\mu, \Sigma)\), then the MGF is \[
M_{\vec X}(\vec t) = e^{\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal\Sigma \vec t}.
\]
By definition \[
M_{\vec X}(\vec t) \overset{\text{def}}{=}\text{E}\!\left[ e^{\vec t \cdot \vec X} \right].
\] By Corollary 45.2, \(Y = \vec t \cdot \vec X\) follows a normal distribution, with mean \[ \text{E}\!\left[ Y \right] = \vec t \cdot \vec \mu \] and variance \[\begin{align}
\text{Var}\!\left[ Y \right] &= \text{Cov}\!\left[ \vec t \cdot \vec X, \vec t \cdot \vec X \right] \\
&= \text{Cov}\!\left[ \sum_{i=1}^k t_i X_i, \sum_{j=1}^k t_j X_j \right] \\
&= \sum_{i=1}^k \sum_{j=1}^k t_i t_j \underbrace{\text{Cov}\!\left[ X_i, X_j \right]}_{\Sigma_{ij}} \\
&= \vec t^\intercal \Sigma \vec t.
\end{align}\]
To evaluate \(\text{E}\!\left[ e^Y \right]\), the simplest way is to plug in \(t = 1\) into the MGF of the normal distribution (Example 34.2). Therefore, \[
M_{\vec X}(\vec t) = \text{E}\!\left[ e^Y \right] = e^{\vec t \cdot \vec \mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t}.
\]
Because the MGF uniquely identifies a distribution, it can be used to prove remarkable results, such as the next result. We know that if two random variables are independent, then their covariance is zero. However, the converse is not true in general; in Example 15.3, we encountered two random variables that had zero covariance but were not independent. However, if two random variables within a multivariate normal have covariance zero, then they are independent.
Theorem 45.1 Let \(\vec Y = (\vec X_1, \vec X_2)\) be a multivariate normal random vector. Then, \(\vec X_1\) and \(\vec X_2\) are independent if and only if \(\text{Cov}\!\left[ \vec X_1, \vec X_2 \right] = 0\).
The “only if” direction is true for all random vectors; covariance is defined so that it is zero for independent random variables. What is unique to the multivariate normal distribution is the converse.
First, we partition the mean vector and covariance matrix, in the same way we partitioned \(\vec Y\) into \(\vec X_1\) and \(\vec X_2\). \[\begin{align}
\mu &= \text{E}\!\left[ \vec Y \right] = \begin{pmatrix} \text{E}\!\left[ \vec X_1 \right] \\ \text{E}\!\left[ \vec X_2 \right] \end{pmatrix} = \begin{pmatrix} \vec{\mu_1} \\ \vec{\mu_2} \end{pmatrix} \\
\Sigma &= \text{Var}\!\left[ \vec Y \right] = \begin{bmatrix} \text{Var}\!\left[ \vec X_1 \right] & \text{Cov}\!\left[ \vec X_1, \vec X_2 \right] \\ \text{Cov}\!\left[ \vec X_1, \vec X_2 \right]^\intercal & \text{Var}\!\left[ \vec X_2 \right]\end{bmatrix} = \begin{bmatrix} \Sigma_{11} & 0 \\ 0 & \Sigma_{22} \end{bmatrix},
\end{align}\] where \(\Sigma_{ii} \overset{\text{def}}{=}\text{Var}\!\left[ \vec X_i \right]\).
First, let’s figure out what the MGF of \(\vec Y\) would be if \(\vec X_1\) and \(\vec X_2\) were independent: \[\begin{align}
M_{\vec Y}(\vec t) &= \text{E}\!\left[ e^{\vec t \cdot \vec Y} \right] \\
&= \text{E}\!\left[ e^{\vec t_1 \cdot \vec X_1 + \vec t_2 \cdot \vec X_2} \right] & (\vec t = (\vec t_1, \vec t_2)) \\
&= \text{E}\!\left[ e^{\vec t_1 \cdot \vec X_1} \right] \text{E}\!\left[ e^{\vec t_2 \cdot \vec X_2} \right] & \text{(by independence)} \\
&= e^{\vec t_1 \cdot \vec \mu_1 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1}
e^{\vec t_2 \cdot \vec \mu_2 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2} & \text{(multivariate normal MGF)}.
\end{align}\]
Now, let’s determine the MGF of \(\vec Y\) when the covariance is zero. We will see that it matches the MGF above, implying that \(\vec X_1\) and \(\vec X_2\) are independent. Recalling that \[\begin{align}
M_{\vec Y}(\vec t) &= e^{\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t},
\end{align}\] the exponent is \[\begin{align}
\vec t \cdot \vec\mu + \frac{1}{2} \vec t^\intercal \Sigma \vec t &= \begin{pmatrix} \vec t_1 \\ \vec t_2 \end{pmatrix} \cdot \begin{pmatrix} \vec \mu_1 \\ \vec \mu_2 \end{pmatrix} + \frac{1}{2} \begin{pmatrix} \vec t_1 & \vec t_2 \end{pmatrix} \begin{bmatrix} \Sigma_{11} & 0 \\ 0 & \Sigma_{22} \end{bmatrix} \begin{pmatrix} \vec t_1 \\ \vec t_2 \end{pmatrix} \\
&= \vec t_1 \cdot \vec\mu_1 + \vec t_2 \cdot \vec\mu_2 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2.
\end{align}\]
Substituting this back into the MGF above and regrouping terms, we obtain \[\begin{align}
M_{\vec Y}(\vec t) &= e^{\vec t_1 \cdot \vec\mu_1 + \frac{1}{2} \vec t_1^\intercal \Sigma_{11} \vec t_1} e^{ \vec t_2 \cdot \vec\mu_2 + \frac{1}{2} \vec t_2^\intercal \Sigma_{22} \vec t_2},
\end{align}\] which matches the MGF that we calculated above under the assumption of independence. Since the MGF uniquely determines the distribution, \(\vec X_1\) and \(\vec X_2\) must be independent when \(\text{Cov}\!\left[ \vec X_1, \vec X_2 \right] = 0\).
Theorem 45.1 is an incredibly useful result. It says that we can check independence within a multivariate normal vector by simply calculating a covariance, which is usually much easier.
Here is one fundamental and surprising application. If \(X_1, \dots, X_n\) are i.i.d. normal, then the sample mean \(\bar X\) and the sample variance \(S^2\) are actually independent.
Theorem 45.2 (Independence of the sample mean and variance for normal data) If \(X_1, \dots, X_n\) are i.i.d. \(\text{Normal}(\mu, \sigma^2)\) random variables, then \(\bar X\) and \(S^2\) are independent.
First, notice that a vector \(\vec X = (X_1, \dots, X_n)\) of i.i.d. normals is multivariate normal because it can be written as \[
\vec X = \sigma \vec Z + \mu \vec 1,
\] where \(\vec Z\) is a vector of i.i.d. standard normal random variables and \(\vec 1\) is a vector of ones.
Next, we will consider the \((n+1)\)-dimensional random vector \[
\begin{pmatrix} \bar X \\ X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix} = \begin{bmatrix} \rule{2em}{0.5pt}& \frac{\vec 1^\intercal}{n} & \rule{2em}{0.5pt}\\ \\ & I - \frac{\vec 1 \vec 1^\intercal}{n} \\ \\ \\ \end{bmatrix} \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix},
\] which is also multivariate normal by Proposition 45.3.
Note that \(\bar X\) is the first element of this vector, and \(S^2\) depends on the remaining elements. If we can show that the cross-covariance between the first element and the remaining elements is zero, then they must be independent by Theorem 45.1.
\[
\begin{align}
\text{Cov}\!\left[ \begin{pmatrix} X_1 - \bar X \\ X_2 - \bar X \\ \vdots \\ X_n - \bar X \end{pmatrix}, \bar X \right] &= \text{Cov}\!\left[ \left(I - \frac{\vec 1 \vec 1^\intercal}{n}\right) \vec X, \frac{\vec 1^\intercal}{n} \vec X \right]\\
&= \left(I - \frac{\vec 1 \vec 1^\intercal}{n}\right) \underbrace{\text{Var}\!\left[ \vec X \right]}_{\sigma^2 I} \frac{\vec 1}{n} \\
&= \sigma^2 \left(\frac{\vec 1}{n}- \frac{\vec 1 \vec 1^\intercal}{n} \frac{\vec 1}{n} \right) \\
&= \vec 0.
\end{align}
\]
In the second equality above, we used properties of covariance for random vectors (Proposition 44.3).
Multivariate Central Limit Theorem
The multivariate normal distribution is important in statistics for the same reason the normal distribution is. Just as the normal distribution arises from the sum of i.i.d. random variables by the Central Limit Theorem (Theorem 36.1), the multivariate normal distribution arises from the sum of i.i.d. random vectors.
Theorem 45.3 (Multivariate Central Limit Theorem) Let \(\vec X_1, \dots, \vec X_n\) be i.i.d. random vectors with \(\text{E}\!\left[ X_1 \right] = \vec\mu\) and \(\text{Var}\!\left[ X_1 \right] = \Sigma\). Then, \[
\frac{1}{\sqrt{n}} \left(\sum_{i=1}^n \vec X_i - n\vec \mu\right) \overset{d}{\to} \textrm{MVN}(\vec 0, \Sigma),
\tag{45.4}\] or equivalently, \[
\sqrt{n} \left(\vec{\bar X} - \vec\mu\right) \overset{d}{\to} \textrm{MVN}(\vec 0, \Sigma)
\tag{45.5}\]
We saw in the proof Proposition 44.8 that a multinomial random vector \(\vec X\) can be expressed as a sum of i.i.d. multinoullis. Therefore, Theorem 45.3 applies to the multinomial distribution.
Example 45.2 (Multivariate normal approximation to the multinomial) Let \(\vec X \sim \text{Multinomial}(n, \vec p)\). What can we say about the approximate distribution of \(\vec X\) when \(n\) is large?
We showed in Proposition 44.8 that \(\vec X = \vec I_1 + \cdot + \vec I_n\), where \(\vec I_1, \dots, \vec I_n\) are i.i.d. with \[\begin{align}
\text{E}\!\left[ I_1 \right] &= \vec p \\
\text{Var}\!\left[ I_1 \right] &= \text{diag}(\vec p) - \vec p \vec p^\intercal.
\end{align}\]
Therefore, by Theorem 45.3, \[
\frac{1}{\sqrt{n}} (\vec X - n\vec p) \overset{d}{\to} \textrm{MVN}(\vec 0, \text{diag}(\vec p) - \vec p \vec p^\intercal).
\]
In other words, we can approximate \(\vec X\) as \[
\textrm{MVN}\big(n\vec p, n(\text{diag}(\vec p) - \vec p \vec p^\intercal)\big).
\]
Exercises
Exercise 45.1 Let \(X \sim \text{Normal}(0,1)\). Let \(S\) be a random variable such that \[
P(S = 1) = \frac{1}{2} \qquad \text{and} \qquad P(S = -1) = \frac{1}{2}.
\] In other words, \(S\) is a random sign, independent of \(X\).
- Show that \(Y = SX\) is standard normal.
- Show that \((X,Y)\) is not bivariate normal.
Remark. This is a counterexample to “if \(\vec{X}\) and \(\vec{Y}\) are MVN, then \((X,Y)\) is MVN.”
Exercise 45.4 Suppose \((X,Y)\) is bivariate normal with \(X \sim \text{Normal}(0,\sigma_1^2)\) and \(Y \sim \text{Normal}(0,\sigma_2^2)\) marginally and \(\text{Corr}[X,Y] = \rho\). Find a constant \(c\) so that \(Y - cX\) is independent of \(X\).