Random Vectors
So far, we have only considered scalar-valued random variables. Even when we consider the joint distribution of multiple random variables \(X_1, \dots, X_n\), we write \[ f_{X_1, \dots, X_n}(x_1, \dots, x_n). \]
It is often more convenient to think of multiple random variables as a single random vector \[ \vec X = (X_1, \dots, X_n) \] with joint PMF or PDF \(f_{\vec X}(\vec x)\).
Expectations and variances can also be defined for random vectors.
Definition 43.1 (Expectation and variance of a random vector) A random vector \(\vec X = (X_1, \dots, X_n)\) has expectation defined by the mean vector \[
\text{E}\!\left[ \vec X \right] \overset{\text{def}}{=}(\text{E}\!\left[ X_1 \right], \dots, \text{E}\!\left[ X_n \right])
\tag{43.1}\] and variance defined by the covariance matrix \[
\text{Var}\!\left[ \vec X \right] \overset{\text{def}}{=}\begin{bmatrix} \text{Var}\!\left[ X_1 \right] & \text{Cov}\!\left[ X_1, X_2 \right] & \dots & \text{Cov}\!\left[ X_1, X_n \right] \\
\text{Cov}\!\left[ X_2, X_1 \right] & \text{Var}\!\left[ X_2 \right] & \dots & \text{Cov}\!\left[ X_2, X_n \right] \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}\!\left[ X_n, X_1 \right] & \text{Cov}\!\left[ X_n, X_2 \right] & \dots & \text{Var}\!\left[ X_n \right]
\end{bmatrix}.
\tag{43.2}\]
A linear transformation of a random vector is of the form \[ g(\vec X) = A\vec X + \vec b, \] where \(A\) is a matrix and \(\vec b\) is a vector. Note that \(g(\vec X)\) may also be a random vector. Expectations and variances behave as you might expect with respect to linear transformations.
Let \(\vec Y = A \vec X + \vec b\) be an \(m\)-vector. The proof follows by considering individual entries of \(\vec Y\), \(Y_j\), and applying properties of expectation and covariance to these individual entries.
Then, the \(j\)th entry of the mean vector \(\text{E}\!\left[ \vec Y \right]\) is \[
\text{E}\!\left[ Y_j \right] = \text{E}\!\left[ \sum_{i=1}^n A_{ji} X_i + b_j \right] = \sum_{i=1}^n A_{ji} \text{E}\!\left[ X_i \right] + b_j.
\] Rearranging this back into vector form, we obtain Equation 43.3.
Similarly, the \((j, k)\)th entry of the covariance matrix \(\text{Var}\!\left[ \vec Y \right]\) is \[
\begin{align}
\text{Cov}\!\left[ Y_j, Y_k \right] &= \text{Cov}\!\left[ \sum_{i=1}^n A_{ji} X_i + b_j, \sum_{i'=1}^n A_{ki'} X_{i'} + b_k \right] \\
&= \sum_{i=1}^n \sum_{i'=1}^n A_{ji} \text{Cov}\!\left[ X_i, X_{i'} \right] A_{ki'}.
\end{align}
\] This can be written in matrix form as Equation 43.4.
Notice that Proposition 43.1 reduces to Proposition 21.2 and Proposition 21.3 in the scalar case, when \(m = n = 1\). The only formula that does not obviously resemble its scalar counterpart is Equation 43.4. One way to see that Equation 43.4 must be correct is to consider the dimensions of the matrices involved: \[ \underset{m\times n}{A} \underset{n\times n}{\text{Var}[\vec X]} \underset{n\times m}{A^T}. \] \(A^2\) is not well-defined, and no other permutation of \(A\) and \(\text{Var}[\vec X]\) will result in matrices that can be multiplied to produce an \(m \times m\) matrix.
Multiple random vectors behave just like multiple random variables. The proofs follow from considering each element of the random vector, which is a random variable, and applying known properties of random variables.
Proposition 43.2 (Linearity of expectation for random vectors) Let \(\vec X = (X_1, \dots, X_n)\) and \(\vec Y = (Y_1, \dots, Y_n)\) be random vectors. Then, \[ \text{E}\!\left[ \vec X + \vec Y \right] = \text{E}\!\left[ \vec X \right] + \text{E}\!\left[ \vec Y \right]. \]
If we have two random vectors (not necessarily the same length), we can summarize their relationship by the cross-covariance matrix.
Definition 43.2 (Cross-covariance between two random vectors) Let \(\vec X = (X_1, \dots, X_m)\) and \(\vec Y = (Y_1, \dots, Y_n)\) be random vectors. Then, the cross-covariance matrix is an \(m\times n\) matrix defined by \[
\text{Cov}\!\left[ \vec X, \vec Y \right] \overset{\text{def}}{=}\begin{bmatrix} \text{Cov}\!\left[ X_1, Y_1 \right] & \text{Cov}\!\left[ X_1, Y_2 \right] & \dots & \text{Cov}\!\left[ X_1, Y_n \right] \\
\text{Cov}\!\left[ X_2, Y_1 \right] & \text{Cov}\!\left[ X_2, Y_2 \right] & \dots & \text{Cov}\!\left[ X_2, Y_n \right] \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}\!\left[ X_m, Y_1 \right] & \text{Cov}\!\left[ X_m, Y_2 \right] & \dots & \text{Cov}\!\left[ X_m, Y_n \right]
\end{bmatrix}.
\tag{43.5}\]
The cross-covariance of any random vector with itself is the covariance matrix: \[ \text{Var}\!\left[ \vec X \right] = \text{Cov}\!\left[ \vec X, \vec X \right]. \]
If \(\vec X\) and \(\vec Y\) are independent, then \(\text{Cov}\!\left[ \vec X, \vec Y \right] = \underset{m \times n}{0}\) (the zero matrix).
Cross-covariance behaves like covariance.
Proposition 43.3 (Properties of covariance for random vectors) Let \(\vec X, \vec Z\) be random vectors of length \(m\), and let \(\vec Y\) be a random vector of length \(n\). Then, the following are true.
- (Symmetry) \(\text{Cov}\!\left[ \vec X, \vec Y \right] = \text{Cov}\!\left[ \vec Y, \vec X \right]^\intercal\).
- (Constants cannot covary) If \(\vec a\) is a constant vector of length \(n\), then \(\text{Cov}\!\left[ \vec X, \vec a \right] = 0\).
- (Multiplying by a constant) If \(A\) is a constant matrix with dimensions \(k \times m\), then \(\text{Cov}\!\left[ A \vec X, \vec Y \right] = A \text{Cov}\!\left[ \vec X, \vec Y \right]\). If \(B\) is a constant matrix with dimensions \(k \times n\), then \(\text{Cov}\!\left[ \vec X, B \vec Y \right] = \text{Cov}\!\left[ \vec X, \vec Y \right] B^\intercal\).
- (Distributive property) \(\text{Cov}\!\left[ \vec X+\vec Z, \vec Y \right] = \text{Cov}\!\left[ \vec X, \vec Y \right] + \text{Cov}\!\left[ \vec Z, \vec Y \right]\).
Using Proposition 43.3, we obtain the following result.
Proposition 43.4 (Variance of sum of independent random vectors) Let \(\vec{X_1}, \dots, \vec{X_n}\) be independent random vectors. Then, \[
\text{Var}\!\left[ \vec{X_1} + \cdots + \vec{X_n} \right] = \text{Var}\!\left[ \vec{X_1} \right] + \cdots + \text{Var}\!\left[ \vec{X_n} \right].
\]
Just as with random variables, the distribution of a random vector can be uniquely characterized by the moment generating function.
Definition 43.3 (Moment generating function of a random vector) The moment generating function (MGF) of a random vector \(\vec X\) is defined as \[ M_{\vec X}(\vec t) = \text{E}\!\left[ e^{\vec t \cdot \vec X} \right]. \]
All of the usual properties of MGFs hold. For example, a sum of independent random vectors translates to a product of MGFs, the vector analog of Proposition 34.2.
Proposition 43.5 (MGF of a sum of random vectors) Let \(\vec X\) and \(\vec Y\) be independent random vectors. Then, \[
M_{\vec X + \vec Y}(\vec t) = M_{\vec X}(\vec t) M_{\vec Y}(\vec t).
\tag{43.6}\]
\[\begin{align}
M_{\vec X + \vec Y}(\vec t) &= \text{E}\!\left[ e^{\vec t \cdot (\vec X + \vec Y)} \right] \\
&= \text{E}\!\left[ e^{\vec t \cdot \vec X} e^{\vec t \cdot \vec Y} \right] \\
&= \text{E}\!\left[ e^{\vec t \cdot \vec X} \right] \text{E}\!\left[ e^{\vec t \cdot \vec Y} \right] \\
&= M_{\vec X}(\vec t) M_{\vec Y}(\vec t).
\end{align}\]
We can also calculate moments from this MGF, noting that the first moment is a vector and the second moment is a matrix.
Proposition 43.6 (Generating moments of a random vector with the MGF) The first moment, the mean vector, can be generated by calculating the gradient of the MGF:
\[
\nabla M \rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \right].
\]
The second moment, a matrix, can be generated by calculating the Hessian:
\[
H_M \rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \vec{X^\intercal} \right].
\]
The \(i\)th component of \(\displaystyle \left. \nabla M \right\rvert_{\vec{t} = \vec{0}}\) is \[\begin{align*}
\left( \left. \nabla M \right\rvert_{\vec{t} = \vec{0}} \right)_i &= \left( \frac{\partial}{\partial t_i} \text{E}\!\left[ e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \left( \text{E}\!\left[ \frac{\partial}{\partial t_i} e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \left( \text{E}\!\left[ X_i e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \text{E}\!\left[ X_i \right].
\end{align*}\] Thus, it follows that \[
\left. \nabla M \right\rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \right].
\]
Similarly, the \(ij\)-entry of the Hessian \(H_M\) is \[\begin{align*}
\left( \left. H_M \right\rvert_{\vec{t} = \vec{0}} \right)_{ij} &= \left( \frac{\partial^2}{\partial t_i \partial t_j} \text{E}\!\left[ e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \left( \text{E}\!\left[ \frac{\partial^2}{\partial t_i \partial t_j} e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \left( \text{E}\!\left[ X_j \frac{\partial}{\partial t_i} e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \left( \text{E}\!\left[ X_i X_j e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\
&= \text{E}\!\left[ X_i X_j \right]
\end{align*}\] for any \(i\) and \(j\). Thus, it follows that \[
\left. H_M \right\rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \vec{X}^\intercal \right].
\]
Multinomial Distribution
We have seen how the number of reds in \(n\) spins of a roulette wheel can be modeled as a binomial random variable \(X\). This is because on each spin of the roulette wheel, there are two possible outcomes, “red” or “not red,” and the spins are independent.
What if we instead wanted to model the number of times each color (red, black, or green) comes up? This information cannot be represented by a single random variable. We need a random vector \[ \vec X = (X_1, X_2, X_3), \] whose elements represent the counts of red, black, and green, respectively.
What is the PMF of \(\vec X\)? To answer this question, we need to calculate \[
f_{\vec X}(\vec x) = P(\vec X = \vec x) = P(X_1 = x_1, X_2 = x_2, X_3 = x_3).
\] This probability is zero unless \(x_1 + x_2 + x_3 = n\). In that case, any particular sequence of \(x_1\) reds, \(x_2\) blacks, and \(x_3\) greens has probability \[ \left(\frac{18}{38}\right)^{x_1} \left(\frac{18}{38}\right)^{x_2} \left(\frac{2}{38}\right)^{x_3}, \] and there are \(\frac{n!}{x_1! x_2! x_3!}\) ways to order these \(n\) outcomes. Therefore, the PMF is \[
f_{\vec X}(\vec x) = \frac{n!}{x_1! x_2! x_3!} \left(\frac{18}{38}\right)^{x_1} \left(\frac{18}{38}\right)^{x_2} \left(\frac{2}{38}\right)^{x_3}.
\] This is an example of a named distribution called the multinomial distribution.
Definition 43.4 (Multinomial distribution) If each of \(n\) independent trials results in one of \(k\) outcomes, with probabilities \(\vec p = (p_1, \dots, p_k)\), then the random vector \[
\vec X = (X_1, \dots, X_k)
\] of counts of each outcome is said to follow a \(\text{Multinomial}(n, \vec p)\) distribution.
The PMF of \(\vec X\) is \[
f_{\vec X}(\vec x) = \frac{n!}{x_1! x_2! \dots x_k!} p_1^{x_1} p_2^{x_2} \dots p_k^{x_k}; \qquad \sum_{i=1}^k x_i = n.
\tag{43.7}\]
The coefficient in Equation 43.7 is called a multinomial coefficient and is sometimes written as \[
\binom{n}{x_1, \dots, x_k} \overset{\text{def}}{=}\frac{n!}{x_1! x_2! \dots x_k!}.
\]
The binomial distribution is a special case of the multinomial when \(k=2\). However, in the binomial distribution, we only keep track of the first element \(X_1\) of the random vector \(\vec X = (X_1, X_2)\), since \(X_2 = n - X_1\).
Previously, we lumped the black and green categories into a single “not red” category to get a binomial (which is a special case of the multinomial). This is true more generally. If we lump categories in a multinomial together, we get another multinomial.
Proposition 43.7 (Multinomial lumping) Let \(\vec X\) be \(\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))\). Suppose we lump the first two categories into a single category. That is, define \(\vec Y = (X_1, X_2, \dots, X_{k-2}, X_{k-1}+X_k)\). Then, \[
\vec Y \sim \text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_{k-2}, p_{k-1} + p_k)).
\]
The PMF of \(\vec{Y}\) is, for \(y_1 + \cdots + y_{k-1} = n\), \[\begin{alignat*}{2}
f_{\vec{Y}}(\vec{y}) &= P(X_1 = y_1, X_2 = y_2, \dots, X_{k-2} = y_{k-2}, X_{k-1}+X_k = y_{k-1}) \\
&= \frac{n!}{y_1! y_2! \cdots y_{k-1}!} p_1^{y_1} p_2^{y_2} \cdots p_{k-2}^{y_{k-2}} \sum_{j=0}^{y_{k-1}} \binom{y_{k-1}}{j} p_{k-1}^j p_k^{y_{k-1}-j} \qquad \qquad &&(\text{summing through all $X_{k-1} + X_k = y_{k-1}$ cases}) \\
&= \frac{n!}{y_1! y_2! \cdots y_{k-1}!} p_1^{y_1} p_2^{y_2} \cdots p_{k-2}^{y_{k-2}} (p_{k-1}+p_k)^{y_{k-1}} && (\text{binomial theorem}) \\
\end{alignat*}\] Thus, it follows that \(\vec{Y} \sim \text{Multinomial}(n, \vec{p} = (p_1, p_2, \dots, p_{k-2}, p_{k-1}+p_k))\).
Although Proposition 43.7 requires a formal proof, the result is intuitive. If we now treat the first two categories as a single category, then the chance that each trial lands in this new category, is \(p_1 + p_2\), and nothing else changes.
A useful corollary of Proposition 43.7 is that \(X_1\) is binomial, since we can lump the rest of the categories together to obtain the random vector \[
(X_1, X_2 + \dots + X_n) \sim \text{Multinomial}(n, \vec p = (p_1, \underbrace{p_2 + \dots + p_n}_{1 - p_1})),
\] which is really just a binomial in disguise. Of course, there is nothing special about the first category; the same is true for any component of \(\vec X\).
Corollary 43.1 (Multinomial marginals) Let \(\vec X\) be \(\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))\). Then, each component \(X_j\) of \(\vec X\) follows a \(\text{Binomial}(n, p_j)\) distribution.
Finally, we derive the expectation (mean vector) and variance (covariance matrix) of the multinomial distribution.
Proposition 43.8 (Multinomial expectation and variance) Let \(\vec X\) be \(\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))\). Then, \[\begin{align}
\text{E}\!\left[ \vec X \right] &= n\vec p & \text{Var}\!\left[ \vec X \right] &= n(\text{diag}(\vec p) - \vec p \vec p^\intercal).
\end{align}\]
We can represent the outcome of each trial by a binary vector of the form \[
\vec I = (0, \dots, 0, 1, 0, \dots, 0),
\] which has a \(1\) in the position corresponding to the category that was observed and a \(0\) in all other positions.
Then, we can express \[ \vec X = \vec I_1 + \vec I_2 + \dots + \vec I_n, \] where \(\vec I_1, \vec I_2, \dots, \vec I_n\) are i.i.d. This is exactly the vector version of the argument we presented in Example 14.3, where we expressed a binomial random variable \(X\) as a sum of Bernoulli random variables \(I_1, \dots, I_n\). The vectors \(\vec I_1, \dots, \vec I_n\) are sometimes called multinoulli random variables by analogy.
Next, we determine the mean vector and covariance matrix of any one of the multinoulli random variables, which are i.i.d. The mean vector is simply \[
\text{E}\!\left[ \vec I_1 \right] = \vec p,
\] since \(\text{E}\!\left[ I_{1j} \right] = p_j\).
The covariance matrix requires more work. The diagonal elements are \[ \text{Var}\!\left[ I_{1j} \right] = p_j (1 - p_j), \] while the off-diagonal elements are \[ \text{Cov}\!\left[ I_{1j}, I_{1j'} \right] = \text{E}\big[\underbrace{I_{1j} I_{1j'}}_0\big] - \text{E}\!\left[ I_{1j} \right] \text{E}\!\left[ I_{1j'} \right] = -p_j p_{j'}. \] (The first term is zero because all but one entry of \(\vec I_1\) is \(0\), so the product of any two distinct entries must be \(0\).) We can write both cases concisely in matrix form as \[ \text{Var}\!\left[ \vec I_{1} \right] = \text{diag}(\vec p) - \vec p \vec p^\intercal. \]
Thus, it follows that \[
\text{E}\!\left[ \vec I_j \right] = \vec{p} \qquad \text{and} \qquad \text{Var}\!\left[ \vec I_j \right] = \text{diag}(\vec p) - \vec{p} \vec{p}^\intercal.
\] Since the \(I_j\)’s are independent, it follows that \[
\text{E}\!\left[ \vec{X} \right] = \sum_{j=1}^n \text{E}\!\left[ \vec{I_j} \right] = n \vec{p}
\] by Proposition 43.2, and \[
\text{Var}\!\left[ \vec{X} \right] = \sum_{j=1}^n \text{Var}\!\left[ \vec{I_j} \right] = n ( \text{diag}(\vec{p}) - \vec{p} \vec{p}^\intercal)
\] by Proposition 43.4.
Finally, we conclude by calculating the MGF of the multinomial distribution.
Proposition 43.9 (MGF of the multinomial distribution) The moment generating function of the multinomial distribution is \[
M(\vec t) = \left(\sum_{j=1}^k p_j e^{t_j}\right)^n
\tag{43.8}\]
Perhaps the easiest way to show this is to use the representation of \(\vec X\) as \(\vec I_1 + \dots + \vec I_n\), where \(\vec I_1, \dots, \vec I_n\) are i.i.d. multinoulli random vectors.
The MGF of \(\vec I_1\) is \[ M_{\vec I_1}(\vec t) = \text{E}\!\left[ e^{\vec t \cdot \vec I_1} \right] = \sum_{j=1}^k p_j e^{t_j}, \] since exactly one element of \(\vec I_1\) is \(1\), and the probability that it is the \(j\)th entry is \(p_j\), in which case \(e^{\vec t \cdot \vec I_1}\) evaluates to \(e^{t_j}\).
The MGFs of \(\vec I_1, \dots, \vec I_n\) are all the same, so by Proposition 43.5, the MGF of \(\vec X\) is \[ M_{\vec X}(\vec t) = M_{\vec I_1}(\vec t)^n = \left(\sum_{j=1}^k p_j e^{t_j}\right)^n. \]
Exercises
Exercise 43.1 (Alternative proof of multinomial variance) In Proposition 43.8, we used indicator random variables to show that \(\text{Cov}\!\left[ X_j, X_{j'} \right] = -n p_j p_{j'}\) for \(j \neq j'\).
- Expand \(\text{Var}\!\left[ X_j + X_{j'} \right]\) using properties of covariance.
- Use what you know about the distribution of \(X_j + X_{j'}\) to determine its variance directly, and use this to solve for \(\text{Cov}\!\left[ X_j, X_{j'} \right]\).
Exercise 43.2 (Conditional distribution in a multinomial) If \(\vec X\) is \(\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))\), what is the conditional distribution of \((X_2, \dots, X_k)\) given \(X_1 = x_1\)? Provide the formal calculation, as well as an intuitive explanation.
Exercise 43.3 (Using the multinomial MGF) Use the moment generating function of the multinomial distribution (Equation 43.8) to calculate the mean vector and covariance matrix of the multinomial distribution.