41  Beta Distribution

In this chapter, we introduce a new named continuous distribution, the beta distribution. This distribution arises most naturally as the order statistics of i.i.d. standard uniform random variables.

Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\) random variables. What is the distribution of \(U_{(k)}\)? We already computed this in Example 40.2 using Proposition 40.1. The next proposition restates this result.

Proposition 41.1 (Order statistics of the uniform) Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\) random variables. The PDF of \(U_{(k)}\) is \[ f_{U_{(k)}}(t) = \frac{n!}{(k-1)! (n-k)!} t^{k-1} (1 - t)^{n-k}; \qquad 0 < t < 1. \tag{41.1}\]

If we let \(\alpha=k\) and \(\beta=n-k+1\), we can write Equation 41.1 as \[ f_{U_{(k)}}(t) = \frac{(\alpha + \beta - 1)!}{(\alpha - 1)! (\beta - 1)!} t^{\alpha-1} (1 - t)^{\beta-1}; \qquad 0 < t < 1. \tag{41.2}\]

The distribution of \(U_{(k)}\) is called the \(\textrm{Beta}(\alpha= k, \beta= n-k+1)\) distribution. Notice that the standard uniform distribution is a special case of the beta distribution when \(n = k = 1\) so that \(\alpha = \beta = 1\).

41.1 Generalizing the Beta Distribution

Because the beta distribution is supported on \((0, 1)\), it can be used to model data that is naturally a proportion, such as relative humidity. Suppose we want to model the (daily average) relative humidity in Austin, Texas using a beta distribution. The daily averages from 2023 are below. Some candidates are \(\text{Beta}(\alpha=7, \beta=1)\) (shown in blue) and \(\text{Beta}(\alpha=7, \beta=2)\) (shown in orange).

Unfortunately, \(\beta=2\) is a bit too far to the left, but \(\beta=1\) is too far right. Ideally, we would pick a value of \(\beta\) in between. Can we model the data using fractional values of \(\beta\) (and \(\alpha\))?

If we look at the PDF of the beta distribution (Equation 41.2), we see that \(\alpha\) and \(\beta\) appear inside factorials. We can generalize the beta distribution to non-integer values of \(\alpha\) and \(\beta\) by replacing the factorial by the gamma function (Definition 38.1), as we did for the gamma distribution in Definition 38.2.

Definition 41.1 (Beta distribution) A random variable \(X\) is said to follow a \(\text{Beta}(\alpha, \beta)\) distribution if its PDF is \[ f_X(x) = \begin{cases} \displaystyle\frac{\Gamma(\alpha +\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} & 0 < x < 1\\ 0 & \text{otherwise} \end{cases}, \tag{41.3}\] where \(\alpha, \beta > 0\).

When \(\alpha\) and \(\beta\) are not integers, the beta distribution does not have a simple interpretation (unlike for integer values, where it can be interpreted as the order statistics of i.i.d. uniforms). Nevertheless, non-integer parameters afford us flexibility in modeling.

Next, we calculate formulas for the expected value and variance.

Proposition 41.2 (Expectation and variance of beta) Let \(X\) be a \(\text{Beta}(\alpha, \beta)\) random variable.

\[ \begin{aligned} \text{E}\!\left[ X \right] &= \frac{\alpha}{\alpha+\beta} & \text{Var}\!\left[ X \right] &= \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha+\beta)^2}. \end{aligned} \]

We start with the first moment. \[\begin{align*} \text{E}\!\left[ X \right] &= \int_0^1 x \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 x^\alpha (1-x)^{\beta - 1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha+1) \Gamma(\beta)}{\Gamma(\alpha + \beta + 1)} \int_0^1 \underbrace{\frac{\Gamma(\alpha+\beta+1)}{\Gamma(\alpha+1) \Gamma(\beta)} x^{(\alpha+1)-1} (1-x)^{\beta-1}}_{\text{PDF of $\text{Beta}(\alpha+1,\beta)$}} \, dx \\ &= \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+1)} \\ &= \frac{\alpha}{\alpha+\beta}. \end{align*}\]

Similarly, we compute the second moment. \[\begin{align*} \text{E}\!\left[ X^2 \right] &= \int_0^1 x^2 \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1-x)^{\beta-1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 x^{\alpha+1} (1-x)^{\beta - 1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha+2) \Gamma(\beta)}{\Gamma(\alpha+\beta+2)} \int_0^1 \underbrace{\frac{\Gamma(\alpha+\beta+2)}{\Gamma(\alpha+2) \Gamma(\beta)} x^{(\alpha+2)-1} (1-x)^{\beta-1}}_{\text{PDF of $\text{Beta}(\alpha+2,\beta)$}} \, dx \\ &= \frac{\Gamma(\alpha+2)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+2)} \\ &= \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)}. \end{align*}\]

Hence, the variance is \[ \text{Var}\!\left[ X \right] = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)} - \left( \frac{\alpha}{\alpha+\beta} \right)^2 = \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2}. \]

The support of the beta distribution is \((0, 1)\), so if we shift or scale a beta random variable, it will no longer follow a beta distribution. However, location-scale transformations of the beta distribution can be used to model random variables with a bounded support. For example, if we wanted to model a continuous random variable \(Y\) that only takes on values in the interval \((a, b)\), we could model it as \[ Y = (b - a) X + a, \] where \(X\) is a beta random variable. However, \(Y\) would no longer follow a beta distribution because its support is not \((0, 1)\).

41.2 Maximum Likelihood Estimation

Now that we can model data using the beta distribution with non-integer values of \(\alpha\) and \(\beta\), the question is how do we find the best values of \(\alpha\) and \(\beta\). The answer is maximum likelihood, of course, similar to Section 38.2.

The likelihood for an i.i.d. sample \(X_1, \dots, X_n\) from a \(\text{Beta}(\alpha,\beta)\) distribution is \[ L(\alpha, \beta) = \prod_{i=1}^n \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} X_i^{\alpha - 1} (1 - X_i)^{\beta - 1}, \] so the log-likelihood is \[ \ell(\alpha, \beta) = n\log\Gamma(\alpha+\beta) - n\log\Gamma(\alpha) - n\log\Gamma(\beta) + (\alpha - 1)\sum_{i=1}^n \log X_i + (\beta - 1) \sum_{i=1}^n \log(1 - X_i).\]

Equivalently, we can instead minimize \[ J(\alpha, \beta) = -\frac{1}{n} \ell(\alpha, \beta). \]

Recalling the digamma function \(\psi(\alpha) \overset{\text{def}}{=}\frac{d}{d\alpha} \log \Gamma(\alpha)\) from Example 38.2, we can calculate the gradient of \(J\) to obtain the first-order conditions: \[ \begin{align} \frac{\partial J}{\partial\alpha} &= -\psi(\alpha+\beta) + \psi(\alpha) - \frac{1}{n}\sum_{i=1}^n \log X_i = 0 \\ \frac{\partial J}{\partial\beta} &= -\psi(\alpha+\beta) + \psi(\beta) - \frac{1}{n}\sum_{i=1}^n \log (1 - X_i) = 0 \\ \end{align} \tag{41.4}\]

Solving these equations by hand is hopeless because there are \(\alpha\)s and \(\beta\)s nested inside digamma functions. However, for any given data set, we can find the optimal \(\alpha\) and \(\beta\) numerically using Newton’s method, as we did in Example 38.2.

Example 41.1 (Newton’s method for maximum likelihood) Since we are optimizing over two variables, \(\alpha\) and \(\beta\), we need a multivariable version of Equation 38.10. The multivariable analog is \[ \begin{pmatrix} \alpha \\ \beta \end{pmatrix} \leftarrow \begin{pmatrix} \alpha \\ \beta \end{pmatrix} - \Big[\nabla^2 J(\alpha, \beta)\Big]^{-1} \nabla J(\alpha, \beta), \tag{41.5}\] where \(\nabla J\) is the gradient vector of first (partial) derivatives and \(\nabla^2 J\) is the Hessian matrix of second (partial) derivatives.

We already calculated the gradient in Equation 41.4. That is, the gradient vector is \[ \nabla J(\alpha, \beta) = \begin{pmatrix} -\psi(\alpha+\beta) + \psi(\alpha) - \frac{1}{n}\sum_{i=1}^n \log X_i \\ -\psi(\alpha+\beta) + \psi(\beta) - \frac{1}{n}\sum_{i=1}^n \log (1 - X_i) \end{pmatrix} \]

To calculate the Hessian, we calculate another partial derivative:

\[\begin{align} \frac{\partial^2 J}{\partial\alpha^2} &= -\psi'(\alpha+\beta) + \psi'(\alpha) \\ \frac{\partial^2 J}{\partial\alpha \partial\beta} = \frac{\partial^2 J}{\partial\beta \partial\alpha} &= -\psi'(\alpha+\beta) \\ \frac{\partial^2 J}{\partial\beta^2} &= -\psi'(\alpha+\beta) + \psi'(\beta) \\ \end{align}\]

So the Hessian is the matrix \[ \nabla^2 J(\alpha, \beta) = \begin{bmatrix} \frac{\partial^2 J}{\partial \alpha^2} & \frac{\partial^2 J}{\partial \alpha \partial\beta} \\ \frac{\partial^2 J}{\partial \alpha \partial\beta} & \frac{\partial^2 J}{\partial \beta^2} \end{bmatrix} = \begin{bmatrix} -\psi'(\alpha+\beta) + \psi'(\alpha) & -\psi'(\alpha+\beta) \\ -\psi'(\alpha+\beta) & -\psi'(\alpha+\beta) + \psi'(\beta) \end{bmatrix} \]

Now, we implement Newton’s method on the humidity data from above in R.

Notice that the MLE for \(\beta\) is approximately \(1.63\), which is in between the integer values of \(1\) and \(2\) that we tried previously.

Finally, we plot the estimated beta distribution on top of the data.

41.3 Connection between the Gamma and Beta Distributions

There is a deep connection between the gamma distribution introduced in Chapter 38 and the beta distribution. In particular, if \(X\) and \(Y\) are independent gamma random variables, then it turns out that \(\displaystyle \frac{X}{X + Y}\) (which is always a number between \(0\) and \(1\)) follows a beta distribution.

Proposition 41.3 (Gamma-beta connection) Let \(X \sim \text{Gamma}(\alpha, \lambda)\) and \(Y \sim \text{Gamma}(\beta, \lambda)\) be independent random variables. Then, \[ R = \frac{X}{X + Y} \sim \text{Beta}(\alpha, \beta). \tag{41.6}\]

Moreover, \(R\) is independent of \(X + Y\).

Proof

Let \(S = X + Y\). We will determine, in order, the distributions of

  1. \(X \mid S\),
  2. \(R \mid S\), and
  3. \(R\).

First, we want to know the distribution of \(X \mid S\), but it is easier to determine the distribution of \(S \mid X\). Therefore, we should use Bayes’ rule for random variables (Proposition 26.2) to relate the two conditional distributions: \[ f_{X \mid S}(x \mid s) = \frac{f_{X}(x) f_{S \mid X}(s \mid x)}{f_S(s)}. \tag{41.7}\]

Looking at each term on the right-hand side of Equation 41.7:

  • The distribution of \(X\) is \(\text{Gamma}(\alpha, \lambda)\) by assumption.
  • The distribution of \(S\) is \(\text{Gamma}(\alpha + \beta, \lambda)\), as we determined in Equation 38.7.
  • The conditional distribution \(S \mid X\) can be derived as follows:
    • \((X + Y) \mid \{ X = x \}\) has the same distribution as \((x + Y) \mid \{ X = x \}\), and since \(X\) and \(Y\) are independent, the same distribution as \(x + Y\).
    • But \(x+Y\) is just a location transformation of \(Y\), so its PDF is \(f_{S \mid X}(s \mid x) = f_Y(s - x)\).

Substituting the corresponding PDFs into Equation 41.7, the conditional PDF of \(X \mid S\) is \[\begin{align} f_{X \mid S}(x \mid s) &= \frac{f_{X}(x) f_{Y}(s - x)}{f_S(s)} \\ &= \frac{\left(\frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\lambda x}\right) \left(\frac{\lambda^\beta}{\Gamma(\beta)} (s - x)^{\beta - 1} e^{-\lambda (s - x)}\right) }{\frac{\lambda^{\alpha+\beta}}{\Gamma(\alpha+\beta)} s^{\alpha + \beta - 1} e^{-\lambda s}} \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \left(\frac{x}{s}\right)^{\alpha - 1} \left(1 - \frac{x}{s}\right)^{\beta - 1} \frac{1}{s}; 0 < x < s. \end{align}\]

Next, we determine the distribution of \(R \mid S\) as follows:

  • Note that \(R = \frac{X}{S}\), and \(\frac{X}{S} \mid \{ S = s \}\) has the same distribution as \(\frac{X}{s} \mid \{S = s\}\).
  • But this is just a scale transformation of \(X\) (under its conditional distribution given \(S\)), so its PDF is \(f_{R \mid S}(r \mid s) = s f_{X \mid S}(sr \mid s)\).

Substituting the known expression for \(f_{X \mid S}\) from above, we obtain the conditional PDF of \(R \mid S\): \[\begin{align} f_{R \mid S}(r \mid s) &= s \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} r^{\alpha - 1} \left(1 - r\right)^{\beta - 1} \frac{1}{s} \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} r^{\alpha - 1} \left(1 - r\right)^{\beta - 1}; & 0 < r < 1. \end{align}\]

But this conditional PDF does not involve \(s\) at all! This implies that \(R\) is independent of \(S\), so this must also be the marginal PDF of \(R\). To be precise, we can use the Law of Total Probability to obtain the marginal PDF: \[ f_R(r) = \int_{-\infty}^\infty f_S(s) f_{R \mid S}(r \mid s)\,ds = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} r^{\alpha - 1} \left(1 - r\right)^{\beta - 1} \underbrace{\int_{-\infty}^\infty f_S(s) \,ds}_1. \] which we recognize from Equation 41.3 as the PDF of a \(\text{Beta}(\alpha, \beta)\) random variable .

Proposition 41.3 is a beautiful connection with a simple interpretation. If \(X\) and \(Y\) represent “times” spent doing two activities (e.g., doing homework and studying for exams), then \(\frac{X}{X + Y}\) represents the fraction of time in the first activity. If we model \(X\) and \(Y\) as independent gamma random variables, then Proposition 41.3 says that \(\frac{X}{X + Y}\) will be beta and, moreover, will be independent of \(X + Y\), the total time spent doing both activities.

Proposition 41.3 offers a charming calculus-free derivation of the expectation of the beta distribution.

Example 41.2 (Another proof of the beta expectation) Let \(R\) be \(\text{Beta}(\alpha, \beta)\). Then, by Proposition 41.3, we can write \[ R = \frac{X}{X + Y}, \] where \(X\) and \(Y\) are independent \(\text{Gamma}(\alpha, \lambda)\) and \(\text{Gamma}(\beta, \lambda)\), respectively.

Moreover, \(R\) is independent of \(S = X + Y\), so by Theorem 24.3, \[\begin{align} \text{E}\big[\underbrace{RS}_{X}\big] &= \text{E}\!\left[ R \right] \text{E}\big[\underbrace{S}_{X+Y}\big]. \end{align}\] Rearranging terms, we see that \[ \text{E}\!\left[ R \right] = \frac{\text{E}\!\left[ X \right]}{\text{E}\!\left[ X + Y \right]} = \frac{\frac{\alpha}{\lambda}}{\frac{\alpha+\beta}{\lambda}} = \frac{\alpha}{\alpha + \beta}, \] where we used the known expectation of the gamma distribution from Proposition 38.2.

41.4 Exercises

Exercise 41.1 (Second-price auction) In a second-price auction, the winner is the highest bidder, but they only pay the amount of the next-highest bidder. If there are five participants in an auction, and their bids are independent and uniformly distributed between \(a\) and \(b\), how much is the winner expected to pay?

Hint: Start with the case where \(a = 0\) and \(b = 1\).

Exercise 41.2 (Moments of the beta distribution) Let \(X \sim \text{Beta}(\alpha,\beta)\). Show that the \(k\)th moment of \(X\) is \[ \text{E}\!\left[ X^k \right] = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+k)} \cdot \frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \]

Exercise 41.3 (Mode of the beta distribution) Show that the beta PDF reaches a maximum value at \(x = \frac{\alpha-1}{\alpha+\beta-2}\).

Exercise 41.4 (Power of uniform is beta) Suppose \(U \sim \textrm{Uniform}(a= 0, b= 1)\). Show that \(X = U^k\) follows a beta distribution (with what parameters?) for \(k > 0\).

Exercise 41.5 (Generalizing the gamma-beta connection) Suppose that \(X_1, \dots, X_n\) are independent \(\text{Gamma}(\alpha_i, \lambda)\) random variables. Determine the distribution of \(\displaystyle \frac{X_1}{X_1 + \dots + X_n}\), and show that it is independent of \(X_1 + \dots + X_n\).

Hint: You should be able to do this with minimal calculation if you use the result of Proposition 41.3 and properties of the gamma distribution.

Exercise 41.6 (Another proof of beta variance) Let \(X\) and \(Y\) be independent \(\text{Gamma}(\alpha, \lambda)\) and \(\text{Gamma}(\beta, \lambda)\) random variables.

  1. Argue that \[ \text{E}\!\left[ \frac{X^c}{(X + Y)^c} \right] = \frac{\text{E}\!\left[ X^c \right]}{\text{E}\!\left[ (X + Y)^c \right]} \] for all real numbers \(c\).
  2. Use this fact to rederive the variance of the \(\text{Beta}(\alpha, \beta)\) distribution.

Exercise 41.7 (The \(F\)-distribution) Let \(X\) and \(Y\) be independent \(\text{Gamma}(\alpha, \lambda)\) and \(\text{Gamma}(\beta, \lambda)\) random variables.

  1. Express \(\frac{X}{Y}\) in terms of a beta random variable \(B\). Specify the distribution of \(B\). (Hint: Use the results of Proposition 41.3.)
  2. Use the result from part (a) to find the PDF of \(\frac{X}{Y}\).
  3. Find the PDF of \(W \overset{\text{def}}{=}\frac{X / \alpha}{Y / \beta}\). \(W\) is said to follow an \(F\)-distribution with \(2\alpha\) and \(2\beta\) degrees of freedom. (Hint: This is simply a scale-transformation of the random variable from part (b).)

Exercise 41.8 (Limiting distribution of order statistics) Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\). For a fixed integer \(1 \leq k \leq n\), show that \(n U_{(k)}\) converges in distribution as \(n \to \infty\). Identify the limiting distribution.

Hint: Combine Proposition 41.1 with Proposition 41.3. You should be able to apply the Law of Large Numbers to the denominator.