43  Multivariate Transformations

Given i.i.d. data \(X_1, \dots, X_n\), we know how to determine the distributions of statistics such as

A general statistic is of the form \(g(X_1, \dots, X_n)\). In this chapter, we will learn a general strategy for deriving the distribution of a general statistic when the random variables are continuous and \(g\) is differentiable.

43.1 Univariate Transformations

As a warm-up, consider the case of a single continuous random variable \(X\) with PDF \(f_X(x)\). What is the PDF of \(Y = g(X)\)?

In Chapter 20, we presented a general strategy for deriving the PDF of \(Y\). First, find the CDF of \(Y\), then differentiate to obtain the PDF. However, when \(g(x)\) is a differentiable function that is strictly increasing (or decreasing), there is a simple recipe.

Theorem 43.1 (Univariate transformations) Let \(X\) be a continuous random variable with PDF \(f_X(x)\) with support \((a, b)\) (where \(a\) and \(b\) may be \(\pm\infty\)). Let \(g(x)\) be a differentiable one-to-one function such that \(g'(x) \neq 0\) on \((a, b)\). Then, the PDF of \(Y = g(X)\) is \[ f_Y(y) = f_X(x) \left| \frac{dx}{dy} \right| = f_X(x) \left| \frac{1}{g'(x)} \right|, \tag{43.1}\] where \(x = g^{-1}(y)\).

Proof

By assumption, \(g'(x) \neq 0\), so either \(g'(x) > 0\) or \(g'(x) < 0\) for all \(x \in (a, b)\).

First, suppose \(g'(x) > 0\). Then, the CDF of \(Y\) is \[ F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \leq g^{-1}(y)) = F_X(g^{-1}(y)) \] for all \(y\) in the support of \(Y\). Note that we used the fact that \(g\) is strictly increasing to assert that the events \(\{ g(X) \leq y \}\) and \(\{ X \leq g^{-1}(y) \}\) are equivalent.

Now, we differentiate to obtain the PDF: \[ f_Y(y) = f_X(g^{-1}(y)) \frac{dx}{dy}, \] where we applied the chain rule to \(x = g^{-1}(y)\).

On the other hand, if \(g'(x) < 0\), then \[ F_Y(y) = P(Y \leq y) = P(g(X) \leq y) = P(X \geq g^{-1}(y)) = 1 - F_X(g^{-1}(y)), \] and \[ f_Y(y) = - f_X(g^{-1}(y)) \frac{dx}{dy}. \]

Note that \(\frac{dx}{dy} > 0\) in the case where \(g'(x) > 0\), and \(\frac{dx}{dy} < 0\) in the case where \(g'(x) < 0\). If we add an absolute value sign, we can cover both cases with a single formula: \[ f_X(g^{-1}(y)) \left| \frac{dx}{dy} \right|. \]

Notice that by the inverse function theorem, we can write \[ \frac{dx}{dy} = \frac{1}{\frac{dy}{dx}} = \frac{1}{g'(x)}. \]

One natural application of Theorem 43.1 is to location-scale transformations.

Example 43.1 (Location-scale transformations revisited) Here is another proof of Proposition 20.3.

Let \(Y = aX + b\) and consider \(g(x) = ax + b\). Then, \(g'(x) = a\) and by Theorem 43.1, \[ f_Y(y) = f_X(x) \left| \frac{1}{g'(x)} \right| = f_X(x) \frac{1}{|a|}. \]

Although this is technically correct, we usually express the PDF of \(Y\) in terms of \(y\). Since \[ x = g^{-1}(y) = \frac{y-b}{a}, \] we can substitute to obtain \[ f_Y(y) = f_X \left( \frac{y-b}{a} \right) \frac{1}{|a|}, \] which matches the result from Proposition 20.3.

43.2 Multivariate Transformations

Now, we consider the case where we have \(n\) random variables \(X_1, \dots, X_n\), and \(g\) is a differentiable one-to-one function from \(A \subset \mathbb{R}^n\) to \(B \subset \mathbb{R}^n\). That is, \[ (Y_1, \dots, Y_n) = g(X_1, \dots, X_n). \] Note that the “derivative” of a function from \(\mathbb{R}^n\) to \(\mathbb{R}^n\) is an \(n\times n\) matrix of partial derivatives, \[ \left[\frac{\partial (y_1, \dots, y_n)}{\partial (x_1, \dots, x_n)}\right] \overset{\text{def}}{=}\begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \dots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \dots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_n}{\partial x_1} & \frac{\partial y_n}{\partial x_2} & \dots & \frac{\partial y_n}{\partial x_n} \end{bmatrix}, \] called the Jacobian matrix.

Then, the natural generalization of Theorem 43.1 is the following.

Theorem 43.2 (Multivariate transformations) Let \(X_1, \dots, X_n\) be continuous random variables with joint PDF \(f_{X_1, \dots, X_n}(x_1, \dots, x_n)\) with support \(A \subset \mathbb{R}^n\). Let \(g\) be a differentiable one-to-one function from \(A\) to \(B \subset \mathbb{R}^n\). Assume that the Jacobian matrix is invertible at every point in \(A\).

Then, the joint PDF of \((Y_1, \dots, Y_n) = g(X_1, \dots, X_n)\) is \[ f_{Y_1, \dots, Y_n}(y_1, \dots, y_n) = f_{X_1, \dots, X_n}(x_1, \dots, x_n) \left| \det \left[\frac{\partial (x_1, \dots, x_n)}{\partial (y_1, \dots, y_n)}\right] \right|, \tag{43.2}\] where \((x_1, \dots, x_n) = g^{-1}(y_1, \dots, y_n)\).

Note that by the inverse function theorem, \[ \det \left[\frac{\partial (x_1, \dots, x_n)}{\partial (y_1, \dots, y_n)}\right] = \det \left[\frac{\partial (y_1, \dots, y_n)}{\partial (x_1, \dots, x_n)}\right]^{-1} = \frac{1}{\det \left[\frac{\partial (y_1, \dots, y_n)}{\partial (x_1, \dots, x_n)}\right]}, \] so we can calculate the Jacobian of the \(x\)s with respect to the \(y\)s, or the Jacobian of the \(y\)s with respect to the \(x\)s, whichever is easier.

Let’s apply Theorem 43.2 to some examples.

Example 43.2 (Gamma-beta connection revisited) In Proposition 42.3, we showed that if \(X\) and \(Y\) are independent \(\text{Gamma}(\alpha, \lambda)\) and \(\text{Gamma}(\beta, \lambda)\) random variables, then

\[ T \overset{\text{def}}{=}\frac{X}{X + Y} \]

follows a \(\text{Beta}(\alpha, \beta)\) distribution and is independent of \(X + Y\). Another way to show this is using Jacobians, and this approach will yield an additional insight.

Let \(S = X + Y\). Then, by Theorem 43.2, the joint PDF of \(S\) and \(T\) is \[ f_{S, T}(s, t) = f_{X, Y}(x, y) \left| \det\left[ \frac{\partial (x, y)}{\partial(s, t)} \right] \right|. \]

In this case, it is easier to calculate the inverse Jacobian, since we already know \((s, t) = g(x, y)\), but we would need to do some work to determine \((x, y) = g^{-1}(s, t)\).

\[\begin{align} \det\left[ \frac{\partial (s, t)}{\partial(x, y)} \right] &= \det \begin{bmatrix} \frac{\partial s}{\partial x} & \frac{\partial s}{\partial y} \\ \frac{\partial t}{\partial x} & \frac{\partial t}{\partial y} \end{bmatrix} \\ &= \det \begin{bmatrix} 1 & 1 \\ \frac{y}{(x + y)^2} & -\frac{x}{(x + y)^2} \end{bmatrix} \\ &= -\frac{1}{x + y}. \end{align}\]

We next solve for \(s\) and \(t\) in terms of \(x\) and \(y\). Since \(s = x + y\), we see that \[ t = \frac{x}{x+y} = \frac{x}{s}, \] and hence, \(x = st\) and \(y = s - x = s - st\). Next,

\[\begin{align} f_{S, T}(s, t) &= f_{X, Y}(x, y) \left| \frac{1}{\frac{-1}{x + y}} \right| \\ &= \left( \frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\lambda x} \right) \left( \frac{\lambda^\beta}{\Gamma(\beta)} y^{\beta - 1} e^{-\lambda y} \right) (x + y) \\ &= \frac{\lambda^\alpha}{\Gamma(\alpha)} (st)^{\alpha - 1} e^{-\lambda st} \frac{\lambda^\beta}{\Gamma(\beta)} (s - st)^{\beta - 1} e^{-\lambda (s - st)} s & (x = st, y = s - st) \\ &= \frac{\lambda^\alpha} {\Gamma(\alpha)} s^{\alpha - 1} t^{\alpha - 1} \frac{\lambda^\beta}{\Gamma(\beta)} s^{\beta-1} (1-t)^{\beta-1} e^{-\lambda s} s \\ &= \frac{\lambda^{\alpha+\beta}}{\Gamma(\alpha)\Gamma(\beta)} \left( s^{\alpha+\beta-1} e^{-\lambda s} \right) \left( t^{\alpha-1}(1-t)^{\beta-1} \right). \end{align}\] We recognize the terms involving \(s\) as proportional to a \(\text{Gamma}(\alpha+\beta,\lambda)\) PDF and the terms involving \(t\) as proportional to a \(\text{Beta}(\alpha, \beta)\) PDF. We can multiply and divide by \(\Gamma(\alpha + \beta)\) in order to see the two PDFs explicitly: \[ f_{S,T}(s,t) = \underbrace{\frac{\lambda^{\alpha+\beta}}{\Gamma(\alpha + \beta)} s^{\alpha + \beta - 1} e^{-\lambda s}}_{f_S(s)} \cdot \underbrace{\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} (1 - t)^{\beta - 1}}_{f_T(t)}. \] This factorization shows that \(S \sim \text{Gamma}(\alpha + \beta, \lambda)\) and \(T \sim \text{Beta}(\alpha, \beta)\). But it shows something more: \(S\) and \(T\) are also independent!

43.2.1 Simulating normal random variables

Theorem 43.2 provides a basis for simulating normal random variables. We begin by discussing how it can be used to simulate a single normal random variable, which is harder than it sounds.

Example 43.3 (Box-Muller transform) How does a computer simulate values from a normal distribution? We saw a general way to simulate continuous random variables in Proposition 20.4. According to that result, if we start with a \(\textrm{Uniform}(a= 0, b= 1)\) random variable \(U\), then
\[ Z \overset{\text{def}}{=}\Phi^{-1}(U), \tag{43.3}\] follows a \(\textrm{Normal}(\mu= 0, \sigma^2= 1)\) distribution.

However, the standard normal CDF \(\Phi(z)\) has no closed-form formula, so Equation 43.3 is not practical. Here is a simpler way that relies only on standard mathematical functions such as logarithms and cosine, due to Box and Muller (1958).

Let \(U \sim \textrm{Uniform}(a= 0, b= 1)\) and \(T \sim \textrm{Exponential}(\lambda=1)\) be independent random variables. Note that \(T\) is easy to generate using Proposition 20.4 (see Example 20.3). Box and Muller (1958) showed that \[\begin{align} X &= \sqrt{2 T} \cos (2\pi U) \\ Y &= \sqrt{2 T} \sin (2\pi U) \end{align}\] are independent standard normal random variables.

To see this, we apply Theorem 43.2. We first need the Jacobian of the Box-Muller transform: \[\begin{align*} \det\left[ \frac{\partial(x,y)}{\partial(t,u)} \right] &= \det \left[ \begin{matrix} \frac{\partial x}{\partial t} & \frac{\partial x}{\partial u} \\ \frac{\partial y}{\partial t} & \frac{\partial y}{\partial u} \end{matrix} \right] \\ &= \det\left[ \begin{matrix} \frac{1}{\sqrt{2t}} \cos(2\pi u) & - 2\pi \sqrt{2t} \sin(2\pi u) \\ \frac{1}{\sqrt{2t}} \sin(2\pi u) & 2\pi \sqrt{2t} \cos(2\pi u) \end{matrix} \right] \\ &= 2\pi \cos^2(2\pi u) + 2\pi \sin^2(2\pi u) \\ &= 2\pi, \end{align*}\] which implies \(\displaystyle \det\left[ \frac{\partial(t,u)}{\partial(x,y)} \right] = \frac{1}{2\pi}\).

Therefore, the joint PDF of \(X\) and \(Y\) is \[ f_{X,Y}(x,y) = f_{T,U}(t,u) \cdot \left|\frac{1}{2\pi}\right| = e^{-t} \cdot 1 \cdot \frac{1}{2\pi}. \]

In order to express this joint PDF in terms of \(x\) and \(y\), we observe that \[ \begin{align} x^2 + y^2 &= 2t \cos^2(2\pi u) + 2t \sin^2(2\pi u) = 2t & \Rightarrow & & t &= \frac{x^2 + y^2}{2}. \end{align} \] Substituting this into the expression above, we obtain \[\begin{align*} f_{X,Y}(x,y) &= \frac{1}{2\pi} e^{-\frac{x^2+y^2}{2}} \\ &= \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \cdot \frac{1}{\sqrt{2\pi}} e^{-y^2/2}. \end{align*}\]

This factorization implies that \(X\) and \(Y\) are independent standard normal. In particular, if we integrate over \(y\), we obtain \[ f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}. \]

The code below implements the Box-Muller transform.

Notice that we can now simulate a normal random variable with any mean and variance by simply applying a location-scale transform.

With the Box-Muller transform, we get two independent normal random variables for the price of one. We now discuss how to transform these into correlated normal random variables.

Example 43.4 (Bivariate normal distribution) Let \(Z\) and \(W\) be independent standard normal random variables (perhaps generated by the Box-Muller transform above). Define \[ \begin{align} X &= Z \\ Y &= \rho Z + \sqrt{1 - \rho^2} W. \end{align} \tag{43.4}\] Now, \(X\) is clearly standard normal, and \(Y\), as the sum of independent normals, is also standard normal (by Example 34.7). But \(X\) and \(Y\) are no longer independent; their covariance will be \[ \begin{align} \text{Cov}\!\left[ X, Y \right] &= \text{Cov}\!\left[ Z, \rho Z + \sqrt{1 - \rho^2} W \right] \\ &= \rho \underbrace{\text{Cov}\!\left[ Z, Z \right]}_{\text{Var}\!\left[ Z \right] = 1} + \sqrt{1 - \rho^2} \underbrace{\text{Cov}\!\left[ Z, W \right]}_0 \\ &= \rho. \end{align} \]

A random vector \((X, Y)\) generated according to Equation 43.4 is said to follow a standard bivariate normal distribution. What is their joint PDF?

The joint PDF of \(Z\) and \(W\) is clearly \[ f_{Z, W}(z, w) = f_Z(z) \cdot f_W(w) = \frac{1}{2\pi} e^{-(z^2 + w^2)/2}. \] To determine the joint PDF of \(X\) and \(Y\), we apply Theorem 43.2, which says that the two PDFs are related by \[ f_{X,Y}(x,y) = f_{Z,W}(z,w) \left| \det \left[\frac{\partial (z, w)}{\partial (x, y)}\right] \right|. \] In order to apply this result, we need to determine the Jacobian of the transformation, as well as the inverse of the transformation (in order to write \((z, w)\) in terms of \((x, y)\)):

  • The transformation \(g: \mathbb{R}^2 \to \mathbb{R}^2\) implied by Equation 43.4 is: \[ \begin{pmatrix} x \\ y \end{pmatrix} = g(z, w) = \begin{pmatrix} z \\ \rho z + \sqrt{1 - \rho^2} w \end{pmatrix}. \] The Jacobian matrix corresponding to \(g\) is \[\begin{align*} \det\left[ \frac{\partial(x,y)}{\partial(z,w)} \right] &= \det\left[ \begin{matrix} \frac{\partial x}{\partial z} & \frac{\partial x}{\partial w} \\ \frac{\partial y}{\partial z} & \frac{\partial y}{\partial w} \end{matrix} \right] \\ &= \det \left[ \begin{matrix} 1 & 0 \\ \rho & \sqrt{1-\rho^2} \end{matrix} \right] \\ &= \sqrt{1-\rho^2}. \end{align*}\]
  • The inverse transform \(g^{-1}\) is: \[ z = x \qquad \text{and} \qquad w = \frac{y - \rho x}{\sqrt{1-\rho^2}}. \]

Now, by Theorem 43.2, the joint PDF of \(X\) and \(Y\) is \[\begin{align*} f_{X,Y}(x,y) &= f_{Z,W}(z,w) \frac{1}{\sqrt{1-\rho^2}} & \text{(substitute Jacobian)} \\ &= \frac{1}{\sqrt{1-\rho^2}} \cdot \frac{1}{2\pi} \exp\left\{ -\frac{z^2 + w^2}{2} \right\} \\ &= \frac{1}{2\pi \sqrt{1-\rho^2}} \exp \left\{ -\frac{1}{2} \left( x^2 + \left( \frac{y-\rho x}{\sqrt{1 - \rho^2}} \right)^2 \right) \right\} & \text{(substitute inverse transform)} \\ \\ &= \frac{1}{2\pi \sqrt{1-\rho^2}} \exp \left\{ -\frac{1}{2} \frac{(1-\rho^2)x^2 + (y-\rho x)^2}{1-\rho^2} \right\} \\ &= \frac{1}{2\pi \sqrt{1-\rho^2}} \exp \left\{ -\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)} \right\}. \end{align*}\]

We can see that if \(\rho \neq 0\), then \(X\) and \(Y\) are not independent; if \(\rho = 0\), we get the joint PDF of independent standard normals.

Notice that we can further apply a location-scale transformation (Section 20.2) to \(X\) and \(Y\) to obtain a bivariate normal distribution with any mean and variance.

43.3 Exercises

Exercise 43.1 (Polar coordinates of a random point in the unit circle) Let \(X\) and \(Y\) denote the coordinates of a point chosen uniformly in the unit circle. That is, the joint density is \[ f_{X,Y}(x,y) = \frac{1}{\pi}, \qquad x^2 + y^2 \leq 1. \] Find the joint density of the polar coordinates \(R = \sqrt{X^2+Y^2}\) and \(\Theta = \tan^{-1}(Y/X)\).

Exercise 43.2 (Polar coordinates of independent standard normals) Let \((X,Y)\) denote a random point in the plane, and assume the rectangular coordinates \(X\) and \(Y\) are independent standard normal random variables. Find the joint distribution of \(R\) and \(\Theta\), the polar coordinates.

Exercise 43.3 (Rederiving the distribution of a sum of i.i.d. exponentials) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Exponential}(\lambda)\). Define \[ Y_k = X_1 + \cdots + X_k \] for \(1 \leq k \leq n\).

  1. Find the joint density of \(Y_1, \dots, Y_n\).
  2. Use part (a) to find the density of \(Y_n\). Does your result make sense?
  3. Find the conditional density of \(Y_1, \dots, Y_{n-1}\) given \(Y_n = t\).

Exercise 43.4 (Transforming standard normals) Let \(Z_1\) and \(Z_2\) be i.i.d. \(\textrm{Normal}(\mu= 0, \sigma^2= 1)\). Let \(X = Z_1 + Z_2\) and \(Y = Z_1 - Z_2\). We know that both are \(\textrm{Normal}(\mu= 0, \sigma^2= 2)\). But are they independent?

Exercise 43.5 (Transforming exponentials) Let \(X\) and \(Y\) be i.i.d. \(\text{Exponential}(\lambda)\). Let \(T = X + Y\) and \(\displaystyle W = \frac{X}{Y}\). Are \(T\) and \(W\) independent? Find the marginal densities of \(T\) and \(W\).