20  Transformations

\[ \def\mean{\textcolor{red}{1.2}} \]

In some applications, we are interested in a transformation of \(X\), \(g(X)\). This is a new random variable with its own distribution.

One situation in which transformations naturally arise is unit conversion. For example:

In this chapter, we will learn how to derive the PDF of \(Y = g(X)\) from the PDF of \(X\).

20.1 The PDF of a Transformed Random Variable

Suppose we construct a “random” square as follows. First, we pick a length \(X\) which is “equally likely” to be any number between \(0\) and \(1\). That is, the PDF of \(X\) is \[ f_X(x) = \begin{cases} 1 & 0 < x < 1 \\ 0 & \text{otherwise} \end{cases}. \] Then, we draw a square with sides of length \(X\). The area of the square is \[ Y = X^2. \] Notice that \(Y\) is a transformation of \(X\). What is the PDF of \(Y\)?

If we square a number between \(0\) and \(1\), the result is also a number between \(0\) and \(1\). But is \(Y\) also “equally likely” to be any number between \(0\) and \(1\)? We can find out with a simulation.

Clearly, the probability is more concentrated near \(0\), so the PDF of \(Y\) is not the same as the PDF of \(X\). In hindsight, this is not surprising: when a number between \(0\) and \(1\) is squared, the result is smaller. So if the original numbers were equally likely to be between \(0\) and \(1\), the squared numbers should be more likely to be near \(0\).

Here is another way to explain the observation above: the probability that \(X\) is below \(0.5\) is 50%. But values of \(X\) below \(0.5\) correspond to values of \(Y\) below \((0.5)^2 = 0.25\). That is, although \(Y\) can be any number between \(0\) and \(1\), 50% of the probability is below \(0.25\).

Now, we develop a strategy for deriving the PDF of \(Y\), which is based on what we learned in Chapter 18.

General Strategy for Deriving the PDF of \(Y = g(X)\)
  1. First, determine the CDF of \(Y\).
  2. Take the derivative of the CDF to get the PDF.

Let’s apply this strategy to the example above.

Example 20.1 The CDF of \(Y\), as a function of \(y \in (0, 1)\), is \[ \begin{align} F_Y(y) &\leq P(Y \leq y) \\ &= P(X^2 \leq y) \\ &= P(X \leq \sqrt{y}) \\ &= \int_0^{\sqrt{y}} \underbrace{f_X(x)}_{1}\,dx \\ &= \sqrt{y}. \end{align} \] To be precise, this formula is only valid for \(y \in (0, 1)\). The full CDF is \[ F_Y(y) = \begin{cases} 0 & y \leq 0 \\ \sqrt{y} & 0 < y < 1 \\ 1 & y \geq 1 \end{cases}. \]

Therefore, the PDF of \(Y\) is \[ f_Y(y) = F'_Y(y) = \begin{cases} \frac{1}{2\sqrt{y}} & 0 < y < 1 \\ 0 & \text{otherwise} \end{cases}. \]

We can check our answer by graphing this PDF on top of our simulations from before.

In general, we apply the strategy in this section to determine the PDF of a transformed random variable \(Y = g(X)\). However, for specific classes of transformations \(g\), there are simple formulas for the PDF. The next two sections explore two important classes of transformations.

20.2 Location-Scale Transformations

In this section, we will explore an important class of transformations called location-scale transformations.

Definition 20.1 (Location-Scale Transformation) Let \(X\) be a random variable, and let \(a\) and \(b\) be constants.

  • A location transformation is a transformation of the form \[ g(X) = X + b. \]
  • A scale transformation is a transformation of the form \[ g(X) = aX. \] Scale transformations arise commonly when we change units. For example, if \(X\) is measured in meters, and we want to convert it to centimeters, we would multiply \(X\) by \(a = 100\).
  • A location-scale transformation is a combination of the two. It is a transformation of the form \[ g(X) = aX + b. \]

Let’s examine location transformations first. If we add \(b\) to a random variable \(X\), then the support and all the probabilities should shift by \(b\). This is illustrated in Figure 20.1.

Figure 20.1: What a location transformation \(Y = X + b\) does to a PDF

Now let’s develop this observation into a formula.

Proposition 20.1 (Location Transformation) Suppose \(X\) is a continuous random variable with PDF \(f_X(x)\). Let \(Y = X + b\) be a location transformation of \(X\) for some constant \(b\). Then the PDF of \(Y\) is \[ f_Y(x) = f_X(x - b). \tag{20.1}\]

Proof. We apply the strategy from Section 20.1. First, we determine the CDF of \(Y\), in terms of the CDF of \(X\).

\[\begin{align*} F_Y(x) &= P(Y \leq x) \\ &= P(X + b \leq x) \\ &= P(X \leq x - b) \\ &= F_X(x - b) \end{align*}\]

Now we take the derivative to obtain the PDF of \(Y\).

\[ f_Y(x) = \frac{d}{dx} F_Y(x) = \frac{d}{dx} F_X(x - b) = f_X(x - b). \]

Next, let’s examine scale transformations. Suppose we multiply a random variable \(X\) by the constant \(a = 1.5\), producing a new random variable \(Y = 1.5 X\). The support will be stretched out by a factor of \(1.5\) so that if the possible values of \(X\) range from \(0\) to \(6\), the possible values of \(Y\) will range from \(0\) to \(9\). Furthermore, the PDF is “squashed” by a factor of \(1.5\). This effect is illustrated in Figure 20.2.

Figure 20.2: What a scale transformation \(Y = aX\) does to a PDF

While it should be clear that a scale transformation stretches the PDF, it may be less obvious why it also squashes the PDF. Here are two ways to see that the squashing is necessary:

  • If we stretched the PDF without squashing it, there would be too much area under the PDF. The total area under a PDF must equal 1, before and after the transformation.
  • Think of a scale transformation as a change of units, and consider the units on the vertical axis. If \(X\) is in meters and \(Y = 100 X\) is in centimeters, then the units on the vertical axis change from “probability per meter” to “probability per centimeter”. One centimeter is shorter than one meter, so there should be less probability per centimeter than probability per meter!

Now that we understand the intuition, we can formalize the result.

Proposition 20.2 (Scale Transformation) Suppose \(X\) is a continuous random variable with PDF \(f_X(x)\). Let \(Y = aX\) be a scale transformation of \(X\) for some constant \(a \neq 0\). Then the PDF of \(Y\) is \[ f_Y(x) = \frac{1}{|a|} f_X\Big(\frac{x}{a}\Big). \tag{20.2}\]

Proof. We will prove the result for \(a > 0\), leaving the case \(a < 0\) to Exercise 20.1.

We apply the strategy from Section 20.1. First, we determine the CDF of \(Y\), in terms of the CDF of \(X\).

\[\begin{align*} F_Y(x) &= P(Y \leq x) \\ &= P(aX \leq x) \\ &= P\Big(X \leq \frac{x}{a}\Big) \\ &= F_X\Big(\frac{x}{a}\Big) \end{align*}\]

Now we take the derivative to obtain the PDF of \(Y\), remembering to apply the Chain Rule in the last step.

\[ f_Y(x) = \frac{d}{dx} F_Y(x) = \frac{d}{dx} F_X\Big(\frac{x}{a}\Big) = \frac{1}{a} f_X\Big( \frac{x}{a} \Big). \] This matches Equation 20.2 when \(a \geq 0\).

Finally, a location-scale transformation is just a scale transformation followed by a location transformation. So we can obtain the next result by simply combining Proposition 20.1 and Proposition 20.2.

Proposition 20.3 (Location-Scale Transformation) Suppose \(X\) is a continuous random variable with PDF \(f_X(x)\). Let \(Y = aX + b\) be a location-scale transformation of \(X\) for some constants \(a \neq 0\) and \(b\). Then the PDF of \(Y\) is \[ f_Y(x) = \frac{1}{|a|} f_X\Big(\frac{x - b}{a}\Big). \tag{20.3}\]

Proof. We will define the intermediate random variable \(Z = aX\). Then, \(Y = Z + b\) is a location transformation of \(Z\), so by Proposition 20.1: \[ f_Y(x) = f_Z(x - b). \] But \(Z = aX\) is a scale transformation of \(X\), so by Proposition 20.2: \[ f_Z(x - b) = \frac{1}{|a|} f_X\Big( \frac{x - b}{|a|} \Big). \]

Now let’s apply location-scale transformations to an example.

Example 20.2 (Converting Celsius to Fahrenheit) In Example 18.5, we modeled the daily high temperature \(C\) as a continuous random variable with PDF \[ f_C(x) = \frac{1}{k} e^{-x^2/18}; -\infty < x < \infty, \] where \(k\) was a constant that makes the total area equal to 1. We determined the constant \(k\) to be about \(7.5\). This PDF was graphed in Figure 18.7.

An American visitor to Iqaluit might want to know the temperature in Fahrenheit. But this is just a location-scale transformation of the temperature in Celsius! In particular: \[ F = g(C) = \textcolor{green}{\frac{9}{5}} C + \textcolor{orange}{32}. \]

We can derive the PDF of \(F\) using Proposition 20.3 above, with \(a = \textcolor{green}{9/5}\) and \(b = \textcolor{orange}{32}\). Therefore, the PDF of \(F\) is:

\[ \begin{aligned} f_F(x) &= \frac{1}{\textcolor{green}{9/5}} f_C\Big(\frac{x-\textcolor{orange}{32}}{\textcolor{green}{9/5}}\Big) \\ &= \frac{1}{\textcolor{green}{9/5}} \frac{1}{k} e^{\displaystyle -\Big( \frac{x-\textcolor{orange}{32}}{\textcolor{green}{9/5}} \Big)^2 / 18} \\ &\approx \frac{1}{13.53579} e^{-(x - 32)^2 / 58.32} \end{aligned} \tag{20.4}\]

Let’s graph the PDF in Fahrenheit (Equation 20.4):

Figure 20.3: PDF of the daily high temperature (in Fahrenheit) in Iqaluit in May

Compared to Figure 18.7, this PDF is centered around the freezing point in Fahrenheit (\(32^\circ\)) instead of the freezing point in Celsius (\(0^\circ\)).

20.3 The Probability Integral Transform

Let \(X\) be a continuous random variable with CDF \(F(x)\). Then \(U = F(X)\) is also a continuous random variable. What is the distribution of \(U\)?

This particular class of transformations, where we plug a random variable into its own CDF, is called the probability integral transformation. At first, it may not be clear why anyone would do such a thing, but we will see that it is actually one of the most useful tricks in all of probability.

Theorem 20.1 (Probability Integral Transform) Let \(X\) be a continuous random variable with CDF \(F(x)\). Then \(U = F(X)\) is uniformly distributed between 0 and 1. That is, its PDF is \[ f_U(x) = \begin{cases} 1 & 0 < x < 1 \\ 0 & \text{otherwise} \end{cases}. \]

Proof. For simplicity, we will assume that \(F(x)\) is strictly increasing, although the theorem is valid even without this assumption.

Using the strategy from Section 20.1, we first calculate the CDF of \(U\). Note that \(U = F(X)\) is a probability, so its value must be between 0 and 1. For \(0 < x < 1\),we have: \[ \begin{align*} F_U(x) &= P(U \leq x) \\ &= P(F(X) \leq x) \\ &= P(X \leq F^{-1}(x)) \\ &= F(F^{-1}(x)) \\ &= x. \end{align*} \]

In the third line, we used the fact that \(F(x)\) is strictly increasing to conclude that the inverse CDF \(F^{-1}(p)\) is well-defined for \(0 < p < 1\), with \(F^{-1}(F(x)) = x\).

Finaly, we take the derivative to obtain the PDF of \(U\): \[f_U(x) = F'_U(x) = 1,\] and this formula is valid on the support of \(U\), \(0 < x < 1\). The full PDF is \[ f_U(x) = \begin{cases} 1 & 0 < x < 1 \\ 0 & \text{otherwise} \end{cases}. \]

What use is knowing that \(U = F(X)\) has a standard uniform distribution? Some direct applications of the probability integral transform are suggested in Exercise 20.4. But the most compelling application derives from the inverse transformation: \(X = F^{-1}(U)\). This suggests that we can simulate values of \(X\) from any (continuous) distribution by first simulating \(U\) and then calculating \(F^{-1}(U)\). This trick is called inverse transform sampling, and it follows immediately from Theorem 20.1.

Proposition 20.4 (Inverse Transform Sampling) Let \(U\) be a uniform random variable on \((0, 1)\), and let \(F(x)\) be a valid CDF, with \(F^{-1}(p)\) well-defined for all \(p \in (0, 1)\). Then, \[ X = F^{-1}(U) \] is a random variable whose CDF is \(F(x)\).

Proposition 20.4 is useful because a programming language may not have a built-in function to simulate from the distribution you want, but every programming language has a function to generate uniform random numbers between 0 and 1.

Example 20.3 (Simulating the First Arrival Time) Suppose we want to simulate the time \(T\) that the Geiger counter in Example 18.6 clicks for the first time after it is turned on. We showed in Example 18.6 that the CDF of \(T\) is \[ F_T(x) = \begin{cases} 1 - e^{-\mean x} & x > 0 \\ 0 & x \leq 0 \end{cases}. \]

We can obtain the inverse CDF \(F_T^{-1}(p)\) by solving \[ p = F_T(x) = 1 - e^{-\mean x} \] for \(x\). The solution is \[ x = F_T^{-1}(p) = - \frac{1}{\mean} \ln(1 - p). \]

Therefore, by Proposition 20.4, we should be able to simulate \(T\) by first simulating a uniform random number \(U\) between 0 and 1 and then calculating \[ T = F_T^{-1}(U) = -\frac{1}{\mean} \log(1 - U). \] (Note that \(\log\) is the natural logarithm with base \(e\).)

Let’s simulate 10000 \(T\)s using this approach and see how well they agree with the PDF of \(T\), which we know from Example 18.7 to be \[ f_T(t) = \begin{cases} \mean e^{-\mean t} & t > 0 \\ 0 & \text{otherwise} \end{cases}. \]

20.4 Exercises

Exercise 20.1 Complete the proof of Proposition 20.2 by showing that Equation 20.2 also holds when \(a < 0\).

Hint: Remember that when you multiply or divide both sides of an inequality by a negative number, the direction of the inequality flips!

Exercise 20.2 (Heptathlon score) The long jump is one of the seven contests in the heptathlon. Competitors earn points for each event in the heptathlon, and the competitor with the most total points is the winner. But the heptathlon consists of many different contests:

  • The long jump and high jump are measured in distance.
  • The shotput and javelin are also measured in distance, but the distances are much longer.
  • The 200m, 800m, and 100m hurdles are measured in time.

How do we put all these different contests on the same scale? The result of each heptathlon contest is transformed in a different way so that the resulting points are on a similar scale. For example, the transformation that is used for scoring the long jump in the women’s heptathlon is: \[ g(x) = 124.7435 (x - 2.1)^{1.41}; x \geq 2.1, \] where \(x\) is the distance in meters. Joyner-Kersee was also a gold-medal heptathlete. If we want to know how much points she earns in the heptathlon from the long jump, then we need to study the random variable \(S = g(X)\).

Let’s assume that Jackie Joyner-Kersee’s long jump \(X\) is equally likely to be any distance between 6.3 and 7.5 meters, as in Example 18.3. Determine the PDF of \(S\), the contribution to her score from the long jump.

Exercise 20.3 Write code to simulate a random variable \(X\) with the half-triangle PDF \[ \begin{equation} f(x) = \begin{cases} 1 - \frac{x}{2} & 0 \leq x < 2 \\ 0 & \text{otherwise} \end{cases}. \end{equation} \tag{20.5}\]

Exercise 20.4 A common problem in statistics is to determine whether data \(x\) is too large to have plausibly come from a distribution with CDF \(F\). One way to do this is to calculate the probability of observing \(x\) or greater, \(p = 1 - F(x)\), and if this \(p\)-value is small (say, less than \(.05\)), then we conclude that \(x\) did not come from that distribution.

Now, suppose that the data \(X\) is a random variable that really does have CDF \(F\). What is the distribution of the \(p\)-value?