44 Random Vectors and the Multinomial

44.1 Random Vectors

So far, we have only considered scalar-valued random variables. Even when we consider the joint distribution of multiple random variables $X_1, \dots, X_n$, we write \[ f_{X_1, \dots, X_n}(x_1, \dots, x_n). \]

It is often more convenient to think of multiple random variables as a single random vector \[ \vec X = (X_1, \dots, X_n) \] with joint PMF or PDF $f_{\vec X}(\vec x)$.

Expectations and variances can also be defined for random vectors.

Definition 44.1 (Expectation and variance of a random vector) A random vector $\vec X = (X_1, \dots, X_n)$ has expectation defined by the mean vector \[ \text{E}\!\left[ \vec X \right] \overset{\text{def}}{=}(\text{E}\!\left[ X_1 \right], \dots, \text{E}\!\left[ X_n \right]) \tag{44.1}\] and variance defined by the covariance matrix \[ \text{Var}\!\left[ \vec X \right] \overset{\text{def}}{=}\begin{bmatrix} \text{Var}\!\left[ X_1 \right] & \text{Cov}\!\left[ X_1, X_2 \right] & \dots & \text{Cov}\!\left[ X_1, X_n \right] \\ \text{Cov}\!\left[ X_2, X_1 \right] & \text{Var}\!\left[ X_2 \right] & \dots & \text{Cov}\!\left[ X_2, X_n \right] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}\!\left[ X_n, X_1 \right] & \text{Cov}\!\left[ X_n, X_2 \right] & \dots & \text{Var}\!\left[ X_n \right] \end{bmatrix}. \tag{44.2}\]

A linear transformation of a random vector is of the form \[ g(\vec X) = A\vec X + \vec b, \] where $A$ is a matrix and $\vec b$ is a vector. Note that $g(\vec X)$ may also be a random vector. Expectations and variances behave as you might expect with respect to linear transformations.

Proposition 44.1 (Linear transformations of a random vector) Let $\vec X = (X_1, \dots, X_n)$. Then, for any $m \times n$ matrix $A$ and $m$-vector $\vec b$, \[ \text{E}\!\left[ A\vec X + \vec b \right] = A\text{E}\!\left[ \vec X \right] + \vec b \tag{44.3}\] and \[ \text{Var}\!\left[ A\vec X + \vec b \right] = A\text{Var}\!\left[ \vec X \right]A^T. \tag{44.4}\]

Proof

Let $\vec Y = A \vec X + \vec b$ be an $m$-vector. The proof follows by considering individual entries of $\vec Y$, $Y_j$, and applying properties of expectation and covariance to these individual entries.

Then, the $j$th entry of the mean vector $\text{E}\!\left[ \vec Y \right]$ is \[ \text{E}\!\left[ Y_j \right] = \text{E}\!\left[ \sum_{i=1}^n A_{ji} X_i + b_j \right] = \sum_{i=1}^n A_{ji} \text{E}\!\left[ X_i \right] + b_j. \] Rearranging this back into vector form, we obtain Equation 44.3.

Similarly, the $(j, k)$th entry of the covariance matrix $\text{Var}\!\left[ \vec Y \right]$ is \[ \begin{align} \text{Cov}\!\left[ Y_j, Y_k \right] &= \text{Cov}\!\left[ \sum_{i=1}^n A_{ji} X_i + b_j, \sum_{i'=1}^n A_{ki'} X_{i'} + b_k \right] \\ &= \sum_{i=1}^n \sum_{i'=1}^n A_{ji} \text{Cov}\!\left[ X_i, X_{i'} \right] A_{ki'}. \end{align} \] This can be written in matrix form as Equation 44.4.

Notice that Proposition 44.1 reduces to Proposition 21.2 and Proposition 21.3 in the scalar case, when $m = n = 1$. The only formula that does not obviously resemble its scalar counterpart is Equation 44.4. One intuitive justification is to consider the dimensions of the matrices involved: \[ \underset{m\times n}{A} \underset{n\times n}{\text{Var}[\vec X]} \underset{n\times m}{A^T}. \] Any other arrangement would yield incorrect dimensions (or be undefined)

Multiple random vectors behave just like multiple random variables. The proofs follow from considering each element of the random vector, which is a random variable, and applying known properties of random variables.

Proposition 44.2 (Linearity of expectation for random vectors) Let $\vec X = (X_1, \dots, X_n)$ and $\vec Y = (Y_1, \dots, Y_n)$ be random vectors. Then, \[ \text{E}\!\left[ \vec X + \vec Y \right] = \text{E}\!\left[ \vec X \right] + \text{E}\!\left[ \vec Y \right]. \]

If we have two random vectors (not necessarily the same length), we can summarize their relationship by the cross-covariance matrix.

Definition 44.2 (Cross-covariance between two random vectors) Let $\vec X = (X_1, \dots, X_m)$ and $\vec Y = (Y_1, \dots, Y_n)$ be random vectors. Then, the cross-covariance matrix is an $m\times n$ matrix defined by \[ \text{Cov}\!\left[ \vec X, \vec Y \right] \overset{\text{def}}{=}\begin{bmatrix} \text{Cov}\!\left[ X_1, Y_1 \right] & \text{Cov}\!\left[ X_1, Y_2 \right] & \dots & \text{Cov}\!\left[ X_1, Y_n \right] \\ \text{Cov}\!\left[ X_2, Y_1 \right] & \text{Cov}\!\left[ X_2, Y_2 \right] & \dots & \text{Cov}\!\left[ X_2, Y_n \right] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}\!\left[ X_m, Y_1 \right] & \text{Cov}\!\left[ X_m, Y_2 \right] & \dots & \text{Cov}\!\left[ X_m, Y_n \right] \end{bmatrix}. \tag{44.5}\]

The cross-covariance of any random vector with itself is the covariance matrix: \[ \text{Var}\!\left[ \vec X \right] = \text{Cov}\!\left[ \vec X, \vec X \right]. \]

If $\vec X$ and $\vec Y$ are independent, then $\text{Cov}\!\left[ \vec X, \vec Y \right] = \underset{m \times n}{0}$ (the zero matrix).

Cross-covariance behaves like covariance.

Proposition 44.3 (Properties of covariance for random vectors) Let $\vec X, \vec Z$ be random vectors of length $m$, and let $\vec Y$ be a random vector of length $n$. Then, the following are true.

(Symmetry) $\text{Cov}\!\left[ \vec X, \vec Y \right] = \text{Cov}\!\left[ \vec Y, \vec X \right]^\intercal$.
(Constants cannot covary) If $\vec a$ is a constant vector of length $n$, then $\text{Cov}\!\left[ \vec X, \vec a \right] = 0$.
(Multiplying by a constant) If $A$ is a constant matrix with dimensions $k \times m$, then $\text{Cov}\!\left[ A \vec X, \vec Y \right] = A \text{Cov}\!\left[ \vec X, \vec Y \right]$. If $B$ is a constant matrix with dimensions $k \times n$, then $\text{Cov}\!\left[ \vec X, B \vec Y \right] = \text{Cov}\!\left[ \vec X, \vec Y \right] B^\intercal$.
(Distributive property) $\text{Cov}\!\left[ \vec X+\vec Z, \vec Y \right] = \text{Cov}\!\left[ \vec X, \vec Y \right] + \text{Cov}\!\left[ \vec Z, \vec Y \right]$.

Using Proposition 44.3, we obtain the following result.

Proposition 44.4 (Variance of sum of independent random vectors) Let $\vec{X_1}, \dots, \vec{X_n}$ be independent random vectors. Then, \[ \text{Var}\!\left[ \vec{X_1} + \cdots + \vec{X_n} \right] = \text{Var}\!\left[ \vec{X_1} \right] + \cdots + \text{Var}\!\left[ \vec{X_n} \right]. \]

Just as with random variables, the distribution of a random vector can be uniquely characterized by the moment generating function.

Definition 44.3 (Moment generating function of a random vector) The moment generating function (MGF) of a random vector $\vec X$ is defined as \[ M_{\vec X}(\vec t) = \text{E}\!\left[ e^{\vec t \cdot \vec X} \right]. \]

All of the usual properties of MGFs hold. For example, a sum of independent random vectors translates to a product of MGFs, the vector analog of Proposition 34.2.

Proposition 44.5 (MGF of a sum of random vectors) Let $\vec X$ and $\vec Y$ be independent random vectors. Then, \[ M_{\vec X + \vec Y}(\vec t) = M_{\vec X}(\vec t) M_{\vec Y}(\vec t). \tag{44.6}\]

Proof

\[\begin{align} M_{\vec X + \vec Y}(\vec t) &= \text{E}\!\left[ e^{\vec t \cdot (\vec X + \vec Y)} \right] \\ &= \text{E}\!\left[ e^{\vec t \cdot \vec X} e^{\vec t \cdot \vec Y} \right] \\ &= \text{E}\!\left[ e^{\vec t \cdot \vec X} \right] \text{E}\!\left[ e^{\vec t \cdot \vec Y} \right] \\ &= M_{\vec X}(\vec t) M_{\vec Y}(\vec t). \end{align}\]

We can also calculate moments from this MGF, noting that the first moment is a vector and the second moment is a matrix.

Proposition 44.6 (Generating moments of a random vector with the MGF) The first moment, the mean vector, can be generated by calculating the gradient of the MGF:

\[ \nabla M \rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \right]. \]

The second moment, a matrix, can be generated by calculating the Hessian:

\[ H_M \rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \vec{X^\intercal} \right]. \]

Proof

The $i$th component of $\displaystyle \left. \nabla M \right\rvert_{\vec{t} = \vec{0}}$ is \[\begin{align*} \left( \left. \nabla M \right\rvert_{\vec{t} = \vec{0}} \right)_i &= \left( \frac{\partial}{\partial t_i} \text{E}\!\left[ e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \left( \text{E}\!\left[ \frac{\partial}{\partial t_i} e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \left( \text{E}\!\left[ X_i e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \text{E}\!\left[ X_i \right]. \end{align*}\] Thus, it follows that \[ \left. \nabla M \right\rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \right]. \]

Similarly, the $ij$-entry of the Hessian $H_M$ is \[\begin{align*} \left( \left. H_M \right\rvert_{\vec{t} = \vec{0}} \right)_{ij} &= \left( \frac{\partial^2}{\partial t_i \partial t_j} \text{E}\!\left[ e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \left( \text{E}\!\left[ \frac{\partial^2}{\partial t_i \partial t_j} e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \left( \text{E}\!\left[ X_j \frac{\partial}{\partial t_i} e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \left( \text{E}\!\left[ X_i X_j e^{t_1 X_1 + \cdots + t_n X_n} \right] \right)_{\vec{t} = \vec{0}} \\ &= \text{E}\!\left[ X_i X_j \right] \end{align*}\] for any $i$ and $j$. Thus, it follows that \[ \left. H_M \right\rvert_{\vec{t} = \vec{0}} = \text{E}\!\left[ \vec{X} \vec{X}^\intercal \right]. \]

44.2 Multinomial Distribution

We have seen how the number of reds in $n$ spins of a roulette wheel can be modeled as a binomial random variable $X$. This is because on each spin of the roulette wheel, there are two possible outcomes, “red” or “not red,” and the spins are independent.

What if we instead wanted to model the number of times each color (red, black, or green) comes up? This information cannot be represented by a single random variable. We need a random vector \[ \vec X = (X_1, X_2, X_3), \] whose elements represent the counts of red, black, and green, respectively.

What is the PMF of $\vec X$? To answer this question, we need to calculate \[ f_{\vec X}(\vec x) = P(\vec X = \vec x) = P(X_1 = x_1, X_2 = x_2, X_3 = x_3). \] This probability is zero unless $x_1 + x_2 + x_3 = n$. In that case, any particular sequence of $x_1$ reds, $x_2$ blacks, and $x_3$ greens has probability \[ \left(\frac{18}{38}\right)^{x_1} \left(\frac{18}{38}\right)^{x_2} \left(\frac{2}{38}\right)^{x_3}, \] and there are $\frac{n!}{x_1! x_2! x_3!}$ ways to order these $n$ outcomes. Therefore, the PMF is \[ f_{\vec X}(\vec x) = \frac{n!}{x_1! x_2! x_3!} \left(\frac{18}{38}\right)^{x_1} \left(\frac{18}{38}\right)^{x_2} \left(\frac{2}{38}\right)^{x_3}; \qquad\qquad x_1 + x_2 + x_3 = n. \] This is an example of a named distribution called the multinomial distribution.

Definition 44.4 (Multinomial distribution) If each of $n$ independent trials results in one of $k$ outcomes, with probabilities $\vec p = (p_1, \dots, p_k)$, then the random vector \[ \vec X = (X_1, \dots, X_k), \] consisting of counts of each outcome, is said to follow a $\text{Multinomial}(n, \vec p)$ distribution.

The PMF of $\vec X$ is \[ f_{\vec X}(\vec x) = \frac{n!}{x_1! x_2! \dots x_k!} p_1^{x_1} p_2^{x_2} \dots p_k^{x_k}; \qquad \sum_{i=1}^k x_i = n. \tag{44.7}\]

The coefficient in Equation 44.7 is called a multinomial coefficient and is sometimes written as \[ \binom{n}{x_1, \dots, x_k} \overset{\text{def}}{=}\frac{n!}{x_1! x_2! \dots x_k!}. \]

The binomial distribution is a special case of the multinomial when $k=2$. However, in the binomial distribution, we only need to track of the first element $X_1$ of the random vector $\vec X = (X_1, X_2)$, since the second element is $X_2 = n - X_1$. In general, if we know $k-1$ elements of a multinomial, the last element is determined.

Previously, we lumped the black and green categories into a single “not red” category to obtain a binomial (a special case of the multinomial). This is true more generally. If we lump categories in a multinomial together, we get another multinomial.

Proposition 44.7 (Multinomial lumping) Let $\vec X$ be $\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))$. Suppose we lump the last two categories into a single category. That is, define $\vec Y = (X_1, X_2, \dots, X_{k-2}, X_{k-1}+X_k)$. Then, \[ \vec Y \sim \text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_{k-2}, p_{k-1} + p_k)). \]

Proof

The PMF of $\vec{Y}$ is, for $y_1 + \cdots + y_{k-1} = n$, \[\begin{alignat*}{2} f_{\vec{Y}}(\vec{y}) &= P(X_1 = y_1, X_2 = y_2, \dots, X_{k-2} = y_{k-2}, X_{k-1}+X_k = y_{k-1}) \\ &= \frac{n!}{y_1! y_2! \cdots y_{k-1}!} p_1^{y_1} p_2^{y_2} \cdots p_{k-2}^{y_{k-2}} \sum_{j=0}^{y_{k-1}} \binom{y_{k-1}}{j} p_{k-1}^j p_k^{y_{k-1}-j} \qquad \qquad &&(\text{summing through all $X_{k-1} + X_k = y_{k-1}$ cases}) \\ &= \frac{n!}{y_1! y_2! \cdots y_{k-1}!} p_1^{y_1} p_2^{y_2} \cdots p_{k-2}^{y_{k-2}} (p_{k-1}+p_k)^{y_{k-1}} && (\text{binomial theorem}) \\ \end{alignat*}\] Thus, it follows that $\vec{Y} \sim \text{Multinomial}(n, \vec{p} = (p_1, p_2, \dots, p_{k-2}, p_{k-1}+p_k))$.

Although Proposition 44.7 requires a formal proof, the result is intuitive. If we now treat the last two categories as a single category, then the chance that each trial lands in this new category, is $p_{k-1} + p_k$, and nothing else changes.

A useful corollary of Proposition 44.7 is that $X_1$ is binomial, since we can lump the rest of the categories together to obtain the random vector \[ (X_1, X_2 + \dots + X_n) \sim \text{Multinomial}(n, \vec p = (p_1, \underbrace{p_2 + \dots + p_n}_{1 - p_1})), \] which is equivalent to saying $X_1 \sim \text{Binomial}(n, p_1)$. Of course, there is nothing special about the first category; the same is true for any element of $\vec X$.

Corollary 44.1 (Multinomial marginals) Let $\vec X$ be $\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))$. Then, each element $X_j$ of $\vec X$ follows a $\text{Binomial}(n, p_j)$ distribution.

Finally, we derive the expectation (mean vector) and variance (covariance matrix) of the multinomial distribution.

Proposition 44.8 (Multinomial expectation and variance) Let $\vec X$ be $\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))$. Then, \[\begin{align} \text{E}\!\left[ \vec X \right] &= n\vec p & \text{Var}\!\left[ \vec X \right] &= n(\text{diag}(\vec p) - \vec p \vec p^\intercal). \end{align}\]

Proof

We can represent the outcome of each trial by a binary vector of the form \[ \vec I = (0, \dots, 0, 1, 0, \dots, 0), \] which has a $1$ in the position corresponding to the category that was observed and a $0$ in all other positions.

Then, we can express \[ \vec X = \vec I_1 + \vec I_2 + \dots + \vec I_n, \] where $\vec I_1, \vec I_2, \dots, \vec I_n$ are i.i.d. This is exactly the vector version of the argument we presented in Example 14.3, where we expressed a binomial random variable $X$ as a sum of Bernoulli random variables $I_1, \dots, I_n$. The vectors $\vec I_1, \dots, \vec I_n$ are sometimes called multinoulli random vectors by analogy.

Next, we determine the mean vector and covariance matrix of any one of the i.i.d. multinoullis. The mean vector is simply \[ \text{E}\!\left[ \vec I_1 \right] = \vec p, \] since $\text{E}\!\left[ I_{1j} \right] = p_j$.

The covariance matrix requires more work. The diagonal elements are \[ \text{Var}\!\left[ I_{1j} \right] = p_j (1 - p_j), \] while the off-diagonal elements are \[ \text{Cov}\!\left[ I_{1j}, I_{1j'} \right] = \text{E}\big[\underbrace{I_{1j} I_{1j'}}_0\big] - \text{E}\!\left[ I_{1j} \right] \text{E}\!\left[ I_{1j'} \right] = -p_j p_{j'}. \] (The first term is zero because all but one entry of $\vec I_1$ is $0$, so the product of any two distinct entries must be $0$.) We can write both cases concisely in matrix form as \[ \text{Var}\!\left[ \vec I_{1} \right] = \text{diag}(\vec p) - \vec p \vec p^\intercal. \]

Thus, it follows that \[ \text{E}\!\left[ \vec I_j \right] = \vec{p} \qquad \text{and} \qquad \text{Var}\!\left[ \vec I_j \right] = \text{diag}(\vec p) - \vec{p} \vec{p}^\intercal. \] Since the $I_j$’s are independent, it follows that \[ \text{E}\!\left[ \vec{X} \right] = \sum_{j=1}^n \text{E}\!\left[ \vec{I_j} \right] = n \vec{p} \] by Proposition 44.2, and \[ \text{Var}\!\left[ \vec{X} \right] = \sum_{j=1}^n \text{Var}\!\left[ \vec{I_j} \right] = n ( \text{diag}(\vec{p}) - \vec{p} \vec{p}^\intercal) \] by Proposition 44.4.

Lastly, we evaluate the MGF of the multinomial distribution.

Proposition 44.9 (MGF of the multinomial distribution) The moment generating function of the multinomial distribution is \[ M(\vec t) = \left(\sum_{j=1}^k p_j e^{t_j}\right)^n \tag{44.8}\]

Proof

Perhaps the easiest way to show this is to use the representation of $\vec X$ as $\vec I_1 + \dots + \vec I_n$, where $\vec I_1, \dots, \vec I_n$ are i.i.d. multinoulli random vectors.

The MGF of $\vec I_1$ is \[ M_{\vec I_1}(\vec t) = \text{E}\!\left[ e^{\vec t \cdot \vec I_1} \right] = \sum_{j=1}^k p_j e^{t_j}, \] since exactly one element of $\vec I_1$ is $1$, and the probability that it is the $j$th entry is $p_j$, in which case $e^{\vec t \cdot \vec I_1}$ evaluates to $e^{t_j}$.

The MGFs of $\vec I_1, \dots, \vec I_n$ are all the same, so by Proposition 44.5, the MGF of $\vec X$ is \[ M_{\vec X}(\vec t) = M_{\vec I_1}(\vec t)^n = \left(\sum_{j=1}^k p_j e^{t_j}\right)^n. \]

44.3 Exercises

Exercise 44.1 (Alternative proof of multinomial variance) In Proposition 44.8, we used indicator random variables to show that $\text{Cov}\!\left[ X_j, X_{j'} \right] = -n p_j p_{j'}$ for $j \neq j'$.

Expand $\text{Var}\!\left[ X_j + X_{j'} \right]$ using properties of covariance.
Use what you know about the distribution of $X_j + X_{j'}$ to determine its variance directly, and use this to solve for $\text{Cov}\!\left[ X_j, X_{j'} \right]$.

Exercise 44.2 (Conditional distribution in a multinomial) If $\vec X$ is $\text{Multinomial}(n, \vec p = (p_1, p_2, \dots, p_k))$, what is the conditional distribution of $(X_2, \dots, X_k)$ given $X_1 = x_1$? Provide the formal calculation, as well as an intuitive explanation.

Exercise 44.3 (Using the multinomial MGF) Use the moment generating function of the multinomial distribution (Equation 44.8) to calculate the mean vector and covariance matrix of the multinomial distribution.