41  Beta Distribution

In this chapter, we introduce a new named continuous distribution, the beta distribution. This distribution arises most naturally as the order statistics of i.i.d. standard uniform random variables.

Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\) random variables. What is the distribution of \(U_{(k)}\)? We already computed this in Example 40.2 using Proposition 40.1. The next proposition summarizes the conclusion.

Proposition 41.1 (Order statistics of the uniform) Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\) random variables. The PDF of \(U_{(k)}\) is \[ f_{U_{(k)}}(t) = \frac{n!}{(k-1)! (n-k)!} t^{k-1} (1 - t)^{n-k}; 0 < t < 1. \tag{41.1}\]

If we let \(\alpha=k\) and \(\beta=n-k+1\), we can write Equation 41.1 as \[ f_{U_{(k)}}(t) = \frac{(\alpha + \beta - 1)!}{(\alpha - 1)! (\beta - 1)!} t^{k-1} (1 - t)^{n-k}; 0 < t < 1. \tag{41.2}\]

The distribution of \(U_{(k)}\) is called the \(\textrm{Beta}(\alpha= k, \beta= n-k+1)\) distribution. Notice that the standard uniform distribution is a special case of the beta distribution when \(n = k = 1\) so that \(\alpha = \beta = 1\).

41.1 Generalizing the Beta Distribution

Because the beta distribution is supported on \((0, 1)\), it can be used to model data that is naturally a proportion, such as relative humidity. Suppose we want to model the (daily average) relative humidity in Austin, Texas using a beta distribution. The daily averages from 2023 are below. Some candidates are \(\text{Beta}(\alpha=7, \beta=1)\) (shown in blue) and \(\text{Beta}(\alpha=7, \beta=2)\) (shown in orange).

Unfortunately, \(\beta=2\) is a bit too far to the left, but \(\beta=1\) is too far to the right. Ideally, we would pick a value of \(\beta\) in between. Can we model the data using fractional values of \(\beta\) (and \(\alpha\))?

If we look at the PDF of the beta distribution (Equation 41.2), we see that \(\alpha\) and \(\beta\) appear inside factorials. We can generalize the beta distribution to non-integer values of \(\alpha\) and \(\beta\) by replacing the factorial by the gamma function (Definition 35.1), as we did for the gamma distribution in Definition 35.2.

Definition 41.1 (Beta distribution) A random variable \(X\) is said to follow a \(\text{Beta}(\alpha, \beta)\) distribution if its PDF is \[ f_X(x) = \begin{cases} \displaystyle\frac{\Gamma(\alpha +\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} & 0 < x < 1\\ 0 & \text{otherwise} \end{cases}. \tag{41.3}\]

When \(\alpha\) and \(\beta\) are not integers, the beta distribution does not have a simple interpretation (unlike for integer values, where it can be interpreted as the order statistics of i.i.d. uniforms). Nevertheless, non-integer parameters afford us flexibility in modeling.

Next, we calculate formulas for the expected value and variance.

Proposition 41.2 (Expectation and variance of beta) Let \(X\) be a \(\text{Beta}(\alpha, \beta)\) random variable.

\[ \begin{aligned} \text{E}\!\left[ X \right] &= \frac{\alpha}{\alpha+\beta} & \text{Var}\!\left[ X \right] &= \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha+\beta)^2}. \end{aligned} \]

We start with the first moment. \[\begin{align*} \text{E}\!\left[ X \right] &= \int_0^1 x \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 x^\alpha (1-x)^{\beta - 1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha+1) \Gamma(\beta)}{\Gamma(\alpha + \beta + 1)} \int_0^1 \underbrace{\frac{\Gamma(\alpha+\beta+1)}{\Gamma(\alpha+1) \Gamma(\beta)} x^{(\alpha+1)-1} (1-x)^{\beta-1}}_{\text{PDF of $\text{Beta}(\alpha+1,\beta)$}} \, dx \\ &= \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+1)} \\ &= \frac{\alpha}{\alpha+\beta}. \end{align*}\]

Similarly, we compute the second moment. \[\begin{align*} \text{E}\!\left[ X^2 \right] &= \int_0^1 x^2 \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1-x)^{\beta-1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 x^{\alpha+1} (1-x)^{\beta - 1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha+2) \Gamma(\beta)}{\Gamma(\alpha+\beta+2)} \int_0^1 \underbrace{\frac{\Gamma(\alpha+\beta+2)}{\Gamma(\alpha+2) \Gamma(\beta)} x^{(\alpha+2)-1} (1-x)^{\beta-1}}_{\text{PDF of $\text{Beta}(\alpha+2,\beta)$}} \, dx \\ &= \frac{\Gamma(\alpha+2)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+2)} \\ &= \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)}. \end{align*}\]

Hence, the variance is \[ \text{Var}\!\left[ X \right] = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)} - \left( \frac{\alpha}{\alpha+\beta} \right)^2 = \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2}. \]

The support of the beta distribution is \((0, 1)\), so if we shift or scale a beta random variable, it will no longer follow a beta distribution. However, location-scale transformations of the beta distribution can be used to model random variables with a bounded support. For example, if we wanted to model a continuous random variable \(Y\) that only takes on values in the interval \((a, b)\), we could model it as \[ Y = (b - a) X + a, \] where \(X\) is a beta random variable. However, \(Y\) would no longer follow a beta distribution because its support is not \((0, 1)\).

41.2 Maximum Likelihood Estimation

Now that we can model data using the beta distribution with non-integer values of \(\alpha\) and \(\beta\), the question is how do we find the best values of \(\alpha\) and \(\beta\). The answer is maximum likelihood, of course, similar to Section 35.2.

The likelihood for an i.i.d. sample \(X_1, \dots, X_n\) from a \(\text{Beta}(\alpha,\beta)\) distribution is \[ L(\alpha, \beta) = \prod_{i=1}^n \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} X_i^{\alpha - 1} (1 - X_i)^{\beta - 1}, \] so the log-likelihood is \[ \ell(\alpha, \beta) = n\log\Gamma(\alpha+\beta) - n\log\Gamma(\alpha) - n\log\Gamma(\beta) + (\alpha - 1)\sum_{i=1}^n \log X_i + (\beta - 1) \sum_{i=1}^n \log(1 - X_i).\]

Equivalently, to eliminate extraneous factors of \(n\), we can instead maximize \[ J(\alpha, \beta) = \frac{1}{n} \ell(\alpha, \beta). \]

Recalling the digamma function \(\psi(\alpha) \overset{\text{def}}{=}\frac{d}{d\alpha} \log \Gamma(\alpha)\) from Example 35.1, we can calculate the gradient of \(J\) to obtain the first-order conditions: \[ \begin{align} \frac{\partial J}{\partial\alpha} &= \psi(\alpha+\beta) - \psi(\alpha) + \frac{1}{n}\sum_{i=1}^n \log X_i = 0 \\ \frac{\partial J}{\partial\beta} &= \psi(\alpha+\beta) - \psi(\beta) + \frac{1}{n}\sum_{i=1}^n \log (1 - X_i) = 0 \\ \end{align} \tag{41.4}\]

Solving these equations by hand is hopeless because there are \(\alpha\)s and \(\beta\)s nested inside digamma functions. However, for any given data set, we can find the optimal \(\alpha\) and \(\beta\) numerically using Newton’s method, as we did in Example 35.1.

Example 41.1 (Newton’s method for maximum likelihood) Since we are optimizing over two variables, \(\alpha\) and \(\beta\), we need a multivariable version of Equation 35.10. The multivariable analog is \[ \begin{pmatrix} \alpha \\ \beta \end{pmatrix} \leftarrow \begin{pmatrix} \alpha \\ \beta \end{pmatrix} - \Big[J''(\alpha, \beta)\Big]^{-1} J'(\alpha, \beta), \tag{41.5}\] where \(J'\) is the gradient vector of first (partial) derivatives and \(J''\) is the Hessian matrix of second (partial) derivatives.

We already calculated the gradient in Equation 41.4. That is, the gradient vector is \[ J'(\alpha, \beta) = \begin{pmatrix} \psi(\alpha+\beta) - \psi(\alpha) + \frac{1}{n}\sum_{i=1}^n \log X_i \\ \psi(\alpha+\beta) - \psi(\beta) + \frac{1}{n}\sum_{i=1}^n \log (1 - X_i) \end{pmatrix} \]

To calculate the Hessian, we calculate another partial derivative:

\[\begin{align} \frac{\partial^2 J}{\partial\alpha^2} &= \psi'(\alpha+\beta) - \psi'(\alpha) \\ \frac{\partial^2 J}{\partial\alpha \partial\beta} = \frac{\partial^2 J}{\partial\beta \partial\alpha} &= \psi'(\alpha+\beta) \\ \frac{\partial^2 J}{\partial\beta^2} &= \psi'(\alpha+\beta) - \psi'(\beta) \\ \end{align}\]

So the Hessian is the matrix \[ \big[J''(\alpha, \beta)\big] = \begin{bmatrix} \psi'(\alpha+\beta) - \psi'(\alpha) & \psi'(\alpha+\beta) \\ \psi'(\alpha+\beta) & \psi'(\alpha+\beta) - \psi'(\beta) \end{bmatrix} \]

Now, we implement Newton’s method on the humidity data from above in R.

Notice that the MLE for \(\beta\) is approximately \(1.63\), which is in between the integer values of \(1\) and \(2\) that we tried previously.

Finally, we plot the estimated beta distribution on top of the data.

41.3 Connection between the Gamma and Beta Distributions

There is a deep connection between the gamma distribution introduced in Chapter 35 and the beta distribution. In particular, if \(X\) and \(Y\) are independent gamma random variables, then it turns out that \(\displaystyle \frac{X}{X + Y}\) (which is always a number between \(0\) and \(1\)) follows a beta distribution.

Proposition 41.3 (Gamma-beta connection) Let \(X\) be \(\text{Gamma}(\alpha, \lambda)\) and \(Y\) be \(\text{Gamma}(\beta, \lambda)\), independent of one another. Then, \[ T = \frac{X}{X + Y} \sim \text{Beta}(\alpha, \beta). \tag{41.6}\]

Moreover, \(T\) is independent of \(X + Y\).

Proof

Let \(S = X + Y\). We will determine, in order, the distributions of

  1. \(X | S\),
  2. \(T | S\), and
  3. \(T\).

First, we want to know the distribution of \(X | S\), but it is easier to determine the distribution of \(S | X\). Therefore, we should use Bayes’ rule for random variables (Proposition 26.3) to relate the two conditional distributions: \[ f_{X | S}(x | s) = \frac{f_{X}(x) f_{S | X}(s | x)}{f_S(s)}. \tag{41.7}\]

Looking at each term on the right-hand side of Equation 41.7:

  • The distribution of \(X\) is \(\text{Gamma}(\alpha, \lambda)\) by assumption.
  • The distribution of \(S\) is \(\text{Gamma}(\alpha + \beta, \lambda)\), as we determined in Equation 35.7.
  • The conditional distribution \(S|X\) can be derived as follows:
    • \((X + Y) | \{ X = x \}\) has the same distribution as \((x + Y) | \{ X = x \}\), and since \(X\) and \(Y\) are independent, the same distribution as \(x + Y\).
    • But \(x+Y\) is just a location transformation of \(Y\), so its PDF is \(f_{S|X}(s|x) = f_Y(s - x)\).

Substituting the corresponding PDFs into Equation 41.7, the conditional PDF of \(X | S\) is \[\begin{align} f_{X | S}(x | s) &= \frac{f_{X}(x) f_{Y}(s - x)}{f_S(s)} \\ &= \frac{\frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\lambda x}\frac{\lambda^\beta}{\Gamma(\beta)} (s - x)^{\beta - 1} e^{-\lambda (s - x)} }{\frac{\lambda^{\alpha+\beta}}{\Gamma(\alpha+\beta)} s^{\alpha + \beta - 1} e^{-\lambda s}} \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \left(\frac{x}{s}\right)^{\alpha - 1} \left(1 - \frac{x}{s}\right)^{\beta - 1} \frac{1}{s}; 0 < x < s. \end{align}\]

Next, we determine the distribution of \(T | S\) as follows:

  • Note that \(T = \frac{X}{S}\), and \(\frac{X}{S} | \{ S = s \}\) has the same distribution as \(\frac{X}{s} | \{S = s\}\).
  • But this is just a scale transformation of \(X\) (under its conditional distribution given \(S\)), so its PDF is \(f_{T|S}(t|s) = s f_{X|S}(st | s)\).

Substituting the known expression for \(f_{X|S}\) from above, we obtain the the conditional PDF of \(T | S\): \[\begin{align} f_{T|S}(t | s) &= s \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} \left(1 - t\right)^{\beta - 1} \frac{1}{s} \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} \left(1 - t\right)^{\beta - 1}; & 0 < t < 1. \end{align}\]

But this conditional PDF does not involve \(s\) at all! This implies that \(T\) is independent of \(S\), so this must also be the marginal PDF of \(T\). To be precise, we can use the Law of Total Probability to obtain the marginal PDF: \[ f_T(t) = \int_{-\infty}^\infty f_S(s) f_{T|S}(t|s)\,ds = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} \left(1 - t\right)^{\beta - 1} \underbrace{\int_{-\infty}^\infty f_S(s) \,ds}_1. \] which we recognize from Equation 41.3 as the PDF of a \(\text{Beta}(\alpha, \beta)\) random variable .

Proposition 41.3 is a beautiful connection with a simple interpretation. If \(X\) and \(Y\) represent “times” spent doing two activities (e.g., doing homework and studying for exams), then \(\frac{X}{X + Y}\) represents the fraction of time in the first activity. If we model \(X\) and \(Y\) as independent gamma random variables, then Proposition 41.3 says that \(\frac{X}{X + Y}\) will be beta and, moreover, will be independent of \(X + Y\), the total time spent doing both activities.

Proposition 41.3 also allows for a charming alternative derivation of the expectation of the beta distribution.

Example 41.2 (Another proof of the beta expectation) Let \(T\) be \(\text{Beta}(\alpha, \beta)\). Then, by Proposition 41.3, we can write \[ T = \frac{X}{X + Y}, \] where \(X\) and \(Y\) are independent \(\text{Gamma}(\alpha, \lambda)\) and \(\text{Gamma}(\beta, \lambda)\), respectively.

Moreover, \(T\) is independent of \(S = X + Y\), so by Theorem 24.3, \[\begin{align} \text{E}\!\left[ TS \right] &= \text{E}\!\left[ T \right] \text{E}\!\left[ S \right] \\ \text{E}\!\left[ X \right] &= \text{E}\!\left[ T \right] \text{E}\!\left[ X + Y \right], \end{align}\] and \[ \text{E}\!\left[ T \right] = \frac{\text{E}\!\left[ X \right]}{\text{E}\!\left[ X + Y \right]}. \]

Since \(X\) and \(Y\) are both gamma random variables, we know their expectations by Proposition 35.2, so \[ \text{E}\!\left[ T \right] = \frac{\frac{\alpha}{\lambda}}{\frac{\alpha+\beta}{\lambda}} = \frac{\alpha}{\alpha + \beta}. \]

41.4 Exercises

Exercise 41.1 Suppose \(U \sim \textrm{Uniform}(a= 0, b= 1)\). Show that \(X = U^k\) is a Beta distribution for \(k > 0\).

Exercise 41.2 Let \(X \sim \text{Beta}(\alpha,\beta)\). Show that the \(k\)th moment of \(X\) is \[ \text{E}\!\left[ X^k \right] = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+k)} \cdot \frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \]

Exercise 41.3 Let \(U_1, \dots, U_n\) be i.i.d. \(\textrm{Uniform}(a= 0, b= 1)\). For a fixed integer \(1 \leq k \leq n\), show that the density of \(n U_{(k)}\) converges as \(n \to \infty\). Identify the limiting function.

Exercise 41.4 Suppose you have a coin minting machine, which generates a coin with heads probability \(p\), where \(p\) is chosen uniformly between \(0\) and \(1\). Suppose you mint a coin and toss it \(m+n\) times. What is the conditional distribution of \(p\) given that there were \(n\) heads?

Exercise 41.5 (Generalizing the gamma-beta connection) Suppose that \(X_1, \dots, X_n\) are independent \(\text{Gamma}(\alpha_i, \lambda)\) random variables. Determine the distribution of \(\displaystyle \frac{X_1}{X_1 + \dots + X_n}\), and show that it is independent of \(X_1 + \dots + X_n\).

Hint: You should be able to do this with minimal calculation if you use the result of Proposition 41.3 and properties of the gamma distribution.

Exercise 41.6 (Another proof of beta variance) Let \(X\) and \(Y\) be independent \(\text{Gamma}(\alpha, \lambda)\) and \(\text{Gamma}(\beta, \lambda)\) random variables.

  1. Argue that \[ \text{E}\!\left[ \frac{X^c}{(X + Y)^c} \right] = \frac{\text{E}\!\left[ X^c \right]}{\text{E}\!\left[ (X + Y)^c \right]} \] for all real numbers \(c\).
  2. Use this fact to rederive the variance of the \(\text{Beta}(\alpha, \beta)\) distribution.