In Chapter 39, we examined the distribution of the smallest (minimum) and largest (maximum) of \(n\) i.i.d. random variables. In this chapter, we will examine the distribution of the \(k\)th smallest.
Definition 40.1 Let \(X_1, \dots, X_n\) be random variables. Then, \(X_{(k)}\) denotes the \(k\)th smallest number and is called the \(k\)th order statistic.
For example, if we observe \[ X_1 = 0.87, X_2 = 1.34, X_3 = 0.22, X_4 = 0.36, \] then the order statistics are \[ X_{(1)} = 0.22, X_{(2)} = 0.36, X_{(3)} = 0.87, X_{(4)} = 1.34. \]
We will restrict our attention to order statistics of i.i.d. continuous random variables. The theory for discrete random variables is much more difficult because of the positive probability of ties.
Proposition 40.1 (Distribution of order statistics) Let \(X_1, \dots, X_n\) be i.i.d. continuous random variables with PDF \(f(x)\) and CDF \(F(x)\). Then, the PDF of the \(k\)th order statistic, \(X_{(k)}\), is \[
f_{X_{(k)}}(t) = n \binom{n-1}{k-1} F(t)^{k-1} f(t) (1 - F(t))^{n-k}.
\tag{40.1}\]
Proof
By the probability integral transform (Theorem 20.1) \(U_i = F(X_i)\) follows a standard uniform distribution, and since \(F\) is monotone, \(U_{(k)} = F(X_{(k)})\). So we can derive the PDF of \(U_{(k)}\) and transform it to obtain the PDF of \(X_{(k)}\). First, we determine the CDF of \(U_{(k)}\). In order for \(\{ U_{(k)} \leq t \}\), \(k\) or more of \(U_1, \dots, U_n\) must be \(\leq t\). That is, \[ F_{U_{(k)}}(t) = \sum_{j=k}^n \binom{n}{j} t^j (1 - t)^{n-j}. \] Now, we take the derivative to obtain the PDF of \(U_{(k)}\). \[\begin{alignat*}{2}
f_{U_{(k)}}(t) &= F'_{U_{(k)}}(t) \\
&= \sum_{j=k}^n j\binom{n}{j} t^{j-1} (1 - t)^{n-j} - \sum_{j=k}^{n-1} (n-j)\binom{n}{j} t^j (1 - t)^{n - j - 1} && (\text{the $j =n$ term in the second sum is zero})\\
&= \sum_{j=k}^n n\binom{n-1}{j-1} t^{j-1} (1 - t)^{n-j} - \sum_{j=k}^{n-1} n\binom{n-1}{j} t^j (1 - t)^{n - j - 1} \qquad \qquad && (\text{choosing captains}) \\
&= \sum_{j=k}^n n\binom{n-1}{j-1} t^{j-1} (1 - t)^{n-j} - \sum_{j'=k+1}^{n} n\binom{n-1}{j'-1} t^{j'-1} (1 - t)^{n - j'} && (j' = j+1)
\end{alignat*}\] Now, the second sum cancels all terms in the first sum, except for the one, leaving us with \[
f_{U_{(k)}}(t) = n\binom{n-1}{k-1} t^{k-1} (1 - t)^{n-k}.
\] Now, for the general case, the CDF is \[
F_{X_{(k)}}(t) = P(X_{(k)} \leq t) = P(F^{-1}(U_{(k)}) \leq t) = P(U_{(k)} \leq F(t)) = F_{U_{(k)}}(F(t)).
\] Therefore, the PDF is \[\begin{align}
f_{X_{(k)}}(t) = F'_{X_{(k)}}(t) &= f_{U_{(k)}}(F(t)) f(t) \\
&= n\binom{n-1}{k-1} F(t)^{k-1} (1 - F(t))^{n-k} f(t).
\end{align}\]
Here is a heuristic proof that is less technical.
In order to compute the estimate the probability of \(X_{(k)}\) be near \(t\), we need \(k-1\) X’s to the left of \(t\) and \(n-k\) X’s to the right of \(t\). There are \[
\frac{n!}{(k-1)!1!(n-k)!} = n \binom{n-1}{k-1}
\] ways to arrange \(X_1, \dots, X_n\) to have the \(k\)th smallest near \(t\). For each of the \(k-1\) to the left of \(t\), the probability of each event is \(P(X \leq t) = F(t)\). For each of the \(n-k\) to the right of \(t\), the probability of each event is \(P(X \geq t) = 1 - F(t)\). Finally, the probability of the \(k\)th smallest being near \(t\) is \(f(t) \, dt\). Hence, \[\begin{align*}
f_{X_{(k)}}(t) \, dt &\approx P\left( t - \frac{\varepsilon}{2} \leq X_{(k)} \leq t + \frac{\varepsilon}{2} \right) \\
&\approx n \binom{n-1}{k-1} F(t)^{k-1} (1 - F(t))^{n-k} f(t) \, dt.
\end{align*}\] Thus, the PDF of \(X_{(k)}\) matches with the one given above.
Check that when we set \(k = 1\) or \(k = n\) in Equation 40.1, we recover the PDF of the minimum and maximum from Chapter 39.
Example 40.1 (Estimating the median of the exponential distribution) In Example 19.4, we showed that the median of an \(\text{Exponential}(\lambda)\) random variable is \(-\frac{\log(.5)}{\lambda}\).
If we wanted to estimate this median, there are two natural options:
We could estimate \(\lambda\) and plug this estimate into the above function. That is, \(-\log(.5) \bar X\).
We could estimate the median by the sample median \(X_{(n+1)/2}\) (at least when \(n\) is odd).
Which estimator is better?
The first estimator is a linear transformation of the sample mean, so we can easily calculate its expectation and variance: \[\begin{align}
\text{E}\!\left[ -\log(.5)\bar X \right] &= -\log(.5) \text{E}\!\left[ \bar X \right] = -\frac{\log(.5)}{\lambda} \\
\text{Var}\!\left[ -\log(.5)\bar X \right] &= \log(.5)^2 \text{Var}\!\left[ \bar X \right] = \frac{\log(.5)^2}{n\lambda^2}.
\end{align}\] We see that \(-\log(.5)\bar X\) is unbiased with \[
\text{MSE}\!\left[ -\log(.5)\bar X \right] = \frac{\log(.5)^2}{n\lambda^2}.
\tag{40.2}\]
The second estimator is an order statistic. To determine its expectation and variance, we will first need to determine its distribution. To make things simpler, we note that each \(X_i\) can be represented as a scale transformation of a standard exponential, \(X_i = \frac{1}{\lambda} Z_i\), so \[X_{(n+1)/2} = \frac{1}{\lambda} Z_{(n+1)/2}.\] So we just need to determine the distribution of the order statistics of i.i.d. \(\textrm{Exponential}(\lambda=1)\) random variables.
By Proposition 40.1, \(Z_{(n+1)/2}\) has PDF \[\begin{align}
f_{X_{(n+1)/2}}(t) &= n \binom{n-1}{(n-1)/2} F(t)^{(n-1)/2} f(t) (1 - F(t))^{(n-1)/2} \\
&= n \binom{n-1}{(n-1)/2} (1 - e^{-t})^{(n-1)/2} e^{-t} (e^{-t})^{(n-1)/2} \\
&= n \binom{n-1}{(n-1)/2} (1 - e^{-t})^{(n-1)/2} (e^{-t})^{(n+1)/2}.
\end{align}\] This PDF is impractical to integrate by hand, so we will use R to integrate it numerically for a particular value of \(n\). When \(n=9\), the first two moments of \(Z_{(n+1)/2}\) are
so the expectation and variance are \[\begin{align}
\text{E}\!\left[ Z_{(n+1)/2} \right] &\approx 0.7456 \\
\text{Var}\!\left[ Z_{(n+1)/2} \right] &= \text{E}\!\left[ Z_{(n+1)/2}^2 \right] - \text{E}\!\left[ Z_{(n+1)/2} \right]^2 \approx 0.6721 - (0.7456)^2 \approx 0.1162.
\end{align}\]
Comparing this with Equation 40.2 for \(n = 9\), this estimator has a larger MSE, so the first estimator was better.
Other order statistics besides the median, minimum, and maximum can be useful.
Example 40.2 (Interquartile range) If we observe \(X_1, \dots, X_n\), what is the best way to visualize the data? One way is using a boxplot.
A boxplot displays five order statistics: the minimum, the first quartile, the median, the third quartile, and the maximum. The width of the “box” is the distance between the first and third quartiles, called the interquartile range.
How large is the interquartile range if \(X_1, \dots, X_n\) are i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\)? To be concrete, suppose \(n = 9\) so that the first and third quartiles are \(X_{(3)}\) and \(X_{(7)}\), respectively, and the interquartile range is \(X_{(7)} - X_{(3)}\).
To calculate \(\text{E}\!\left[ X_{(7)} - X_{(3)} \right]\), we need the distributions of \(X_{(3)}\) and \(X_{(7)}\). By Proposition 40.1, the PDF of the \(k\)th order statistic of \(n=9\) i.i.d. standard uniform random variables is \[
f_{X_{(k)}}(t) = 9\binom{8}{k-1} t^{k-1} (1 - t)^{9 - k}.
\tag{40.3}\]
Therefore, the expected interquartile range is \[ \text{E}\!\left[ X_{(7)} - X_{(3)} \right] = \text{E}\!\left[ X_{(7)} \right] - \text{E}\!\left[ X_{(3)} \right] = \frac{7}{10} - \frac{3}{10} = 0.4. \]
Note that the “true” interquartile range for the standard uniform distribution is \[ F^{-1}(0.75) - F^{-1}(0.25) = 0.75 - 0.25 = 0.5, \] so the estimator above is negatively biased for the “true” interquartile range.
40.1 Joint Distributions of Order Statistics
What about the joint distribution of two or more order statistics? Even if \(X_1, \dots, X_n\) are assumed independent, the corresponding order statistics \(X_{(1)}, \dots, X_{(n)}\) are not independent because each order statistic cannot be less than the previous one. In this section, we derive a formula for the exact joint distribution.
Proposition 40.2 (Joint distribution of two order statistics) Let \(X_1, \dots, X_n\) be i.i.d. continuous random variables with PDF \(f(x)\) and CDF \(F(x)\). Then, the joint PDF of the \(j\)th and \(k\)th order statistics, \(X_{(j)}\) and \(X_{(k)}\), for \(j < k\), is \[
f_{X_{(j)}, X_{(k)}}(t_j, t_k) = \frac{n!}{(j-1)! (k-j-1)! (n-k)!} F(t_j)^{j-1} f(t_j) (F(t_k) - F(t_j))^{k-j-1} f(t_k) (1 - F(t_k))^{n-k}; t_j < t_k.
\tag{40.4}\]
Proof
The rigorous proof is too technical to present here; instead, we will present a heuristic proof similar to the one in Proposition 40.1.
There are \[
\frac{n!}{(j-1)!1!(k-j-1)!1!(n-k)!}
\] ways to arrange \(X_1, \dots, X_n\) so that \(j-1\) are to the left of \(t_j\), one is near \(t_j\), \(k-j-1\) are between \(t_j\) and \(t_k\), one is near \(t_k\), and \(n-k\) are to the right of \(t_k\).
Then, the probability that \(X_{(j)}\) and \(X_{(k)}\) are near \(t_j\) and \(t_k\), respectively, can be approximated as \[\begin{align*}
f_{X_{(j)}, X_{(k)}}(t_j, t_k) \, dt_j \, dt_k &\approx P\left( t_j - \frac{\varepsilon}{2} < X_{(j)} < t_j + \frac{\varepsilon}{2}, t_k - \frac{\varepsilon}{2} < X_{(k)} < t_k + \frac{\varepsilon}{2} \right) \\
&\approx \frac{n!}{(j-1)!1!(k-j-1)!1!(n-k)!} (F(t_j))^{j-1} f(t_j) (F(t_k) - F(t_j))^{k-j-1} f(t_k) (1 - F(t_k))^{n-k} \, dt_j \, dt_k.
\end{align*}\] Thus, we recover the claimed PDF.
Example 40.3 (Variance of the interquartile range) Continuing Example 40.2, how variable is the interquartile range \(X_{(7)} - X_{(3)}\)?
To calculate \(\text{Var}\!\left[ X_{(7)} - X_{(3)} \right]\), we expand using properties of covariance: \[\begin{align}
\text{Var}\!\left[ X_{(7)} - X_{(3)} \right] &= \text{Cov}\!\left[ X_{(7)} - X_{(3)}, X_{(7)} - X_{(3)} \right] \\
&= \text{Var}\!\left[ X_{(7)} \right] + \text{Var}\!\left[ X_{(3)} \right] - 2 \text{Cov}\!\left[ X_{(3)}, X_{(7)} \right].
\end{align}\] The variances can be calculated using Equation 40.3, but the covariance requires the joint distribution of \(X_{(3)}\) and \(X_{(7)}\).
Finally, we can calculate the variance of the interquartile range: \[ \text{Var}\!\left[ X_{(7)} - X_{(3)} \right] = \frac{21}{1100} + \frac{21}{1100} - 2\frac{9}{1100} = \frac{6}{275}. \]
We can derive the joint distribution of three or more order statistics in the same way. The next result states the joint distribution of all\(n\) order statistics.
Proposition 40.3 (Full joint distribution of order statistics) Let \(X_1, \dots, X_n\) be i.i.d. continuous random variables with PDF \(f(x)\) and CDF \(F(x)\). Then, the joint PDF of the order statistics is \[
f_{X_{(1)}, X_{(2)}, \dots, X_{(n)}}(t_1, t_2, \dots, t_n) = n! f(t_1) f(t_2) \cdots f(t_n); t_1 < t_2 < \dots < t_n.
\tag{40.5}\]
Proof
We present another heuristic proof.
There are \(n!\) ways to arrange which of the \(X_1, \dots, X_n\) will be \(X_{(1)}, \dots, X_{(n)}\). Hence, \[\begin{align*}
f_{X_{(1)}, \dots, X_{(n)}} \, dt_1 \cdots dt_n &\approx P\left( t_1 - \frac{\varepsilon}{2} < X_{(1)} < t_1 + \frac{\varepsilon}{2}, \dots, t_n - \frac{\varepsilon}{2} < X_{(n)} < t_n + \frac{\varepsilon}{2} \right) \\
&\approx n! f(t_1) \cdots f(t_n) \, dt_1 \cdots dt_n,
\end{align*}\] and we recover the desired PDF.
Example 40.4 (Distance between trucks) Three trucks are traveling along a 1-mile stretch of road. If each truck is equally likely to be anywhere along the stretch of road, independently of the other trucks, what is the probability that no 2 trucks are within \(d\) miles of one another.
Let \(X_i\) denote the position of the \(i\)th truck.
Exercise 40.1 (Median for an even number of observations) Let \(X_1, \dots, X_n\) be i.i.d. continuous random variables. When \(n\) is even, the median is defined as the average of the two middle observations: \[ \frac{X_{\left(\frac{n}{2}\right)} + X_{\left( \frac{n}{2} + 1 \right)}}{2}. \]
Exercise 40.2 (Estimating the center of the normal distribution) Let \(X_1, \dots, X_7\) be i.i.d. \(\text{Normal}(\mu, \sigma)\). The MLE of \(\mu\) is \(\bar X\), but another estimator of \(\mu\) is the sample median \(X_{(4)}\).
Show that the sample median \(X_{(4)}\) is also unbiased for estimating \(\mu\), and compare its MSE with that of \(\bar X\).
Exercise 40.3 (Estimating the center of the uniform distribution) If we observe i.i.d. random variables \(X_1, \dots, X_n\) from a \(\textrm{Uniform}(a= \theta - 1, b= \theta + 1)\), how do we estimate the center \(\theta\)? For simplicity, we will assume that \(n\) is odd.
First, let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\). Calculate the expectation and variance of the mean \(\bar U\), median \(U_{\left( \frac{n+1}{2} \right)}\), and the midrange\(\displaystyle \frac{U_{(1)} + U_{(n)}}{2}\).
Now, let \(X_1, \dots, X_n\) be i.i.d. \(\textrm{Uniform}(a= \theta - 1, b= \theta + 1)\). Calculate the MSEs of the mean \(\bar X\), median \(X_{\left( \frac{n+1}{2} \right)}\), and the midrange \(\displaystyle \frac{X_{(1)} + X_{(n)}}{2}\) for estimating \(\theta\). Which estimator is best?
Hint: Let \(X_i = 2 U_i - 1\) so that you can use the results from part a.