42 Beta Distribution

In this chapter, we introduce a new named continuous distribution, the beta distribution. This distribution arises most naturally as the order statistics of i.i.d. standard uniform random variables.

Let $U_1, \dots, U_n$ be i.i.d. $\textrm{Uniform}(a= 0, b= 1)$ random variables. What is the distribution of $U_{(k)}$? We already computed this in Example 41.2 using Proposition 41.1. The next proposition restates this result.

Proposition 42.1 (Order statistics of the uniform) Let $U_1, \dots, U_n$ be i.i.d. $\textrm{Uniform}(a= 0, b= 1)$ random variables. The PDF of $U_{(k)}$ is \[ f_{U_{(k)}}(t) = \frac{n!}{(k-1)! (n-k)!} t^{k-1} (1 - t)^{n-k}; \qquad 0 < t < 1. \tag{42.1}\]

If we let $\alpha=k$ and $\beta=n-k+1$, we can write Equation 42.1 as \[ f_{U_{(k)}}(t) = \frac{(\alpha + \beta - 1)!}{(\alpha - 1)! (\beta - 1)!} t^{\alpha-1} (1 - t)^{\beta-1}; \qquad 0 < t < 1. \tag{42.2}\]

The distribution of $U_{(k)}$ is called the $\textrm{Beta}(\alpha= k, \beta= n-k+1)$ distribution. Notice that the standard uniform distribution is a special case of the beta distribution when $n = k = 1$ so that $\alpha = \beta = 1$.

42.1 Generalizing the Beta Distribution

Because the beta distribution is supported on $(0, 1)$, it can be used to model data that is naturally a proportion, such as relative humidity. Suppose we want to model the (daily average) relative humidity in Austin, Texas using a beta distribution. The daily averages from 2023 are below. Some candidates are $\text{Beta}(\alpha=7, \beta=1)$ (shown in blue) and $\text{Beta}(\alpha=7, \beta=2)$ (shown in orange).

Unfortunately, $\beta=2$ is a bit too far to the left, but $\beta=1$ is too far to the right. Ideally, we would pick a value of $\beta$ in between. Can we model the data using fractional values of $\beta$ (and $\alpha$)?

If we look at the PDF of the beta distribution (Equation 42.2), we see that $\alpha$ and $\beta$ appear inside factorials. We can generalize the beta distribution to non-integer values of $\alpha$ and $\beta$ by replacing the factorial by the gamma function (Definition 39.1), as we did for the gamma distribution in Definition 39.2.

Definition 42.1 (Beta distribution) A random variable $X$ is said to follow a $\text{Beta}(\alpha, \beta)$ distribution if its PDF is \[ f_X(x) = \begin{cases} \displaystyle\frac{\Gamma(\alpha +\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} & 0 < x < 1\\ 0 & \text{otherwise} \end{cases}, \tag{42.3}\] where $\alpha, \beta > 0$.

When $\alpha$ and $\beta$ are not integers, the beta distribution does not have a simple interpretation (unlike for integer values, where it can be interpreted as the order statistics of i.i.d. uniforms). Nevertheless, non-integer parameters afford us flexibility in modeling.

Next, we calculate formulas for the expected value and variance.

Proposition 42.2 (Expectation and variance of beta) Let $X$ be a $\text{Beta}(\alpha, \beta)$ random variable.

\[ \begin{aligned} \text{E}\!\left[ X \right] &= \frac{\alpha}{\alpha+\beta} & \text{Var}\!\left[ X \right] &= \frac{\alpha \beta}{(\alpha+\beta+1)(\alpha+\beta)^2}. \end{aligned} \]

Proof

We start with the first moment. \[\begin{align*} \text{E}\!\left[ X \right] &= \int_0^1 x \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 x^\alpha (1-x)^{\beta - 1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha+1) \Gamma(\beta)}{\Gamma(\alpha + \beta + 1)} \int_0^1 \underbrace{\frac{\Gamma(\alpha+\beta+1)}{\Gamma(\alpha+1) \Gamma(\beta)} x^{(\alpha+1)-1} (1-x)^{\beta-1}}_{\text{PDF of $\text{Beta}(\alpha+1,\beta)$}} \, dx \\ &= \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+1)} \\ &= \frac{\alpha}{\alpha+\beta}. \end{align*}\]

Similarly, we compute the second moment. \[\begin{align*} \text{E}\!\left[ X^2 \right] &= \int_0^1 x^2 \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1-x)^{\beta-1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 x^{\alpha+1} (1-x)^{\beta - 1} \, dx \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha+2) \Gamma(\beta)}{\Gamma(\alpha+\beta+2)} \int_0^1 \underbrace{\frac{\Gamma(\alpha+\beta+2)}{\Gamma(\alpha+2) \Gamma(\beta)} x^{(\alpha+2)-1} (1-x)^{\beta-1}}_{\text{PDF of $\text{Beta}(\alpha+2,\beta)$}} \, dx \\ &= \frac{\Gamma(\alpha+2)}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+2)} \\ &= \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)}. \end{align*}\]

Hence, the variance is \[ \text{Var}\!\left[ X \right] = \frac{\alpha(\alpha+1)}{(\alpha+\beta)(\alpha+\beta+1)} - \left( \frac{\alpha}{\alpha+\beta} \right)^2 = \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2}. \]

The support of the beta distribution is $(0, 1)$, so if we shift or scale a beta random variable, it will no longer follow a beta distribution. However, location-scale transformations of the beta distribution can be used to model random variables with a bounded support. For example, if we wanted to model a continuous random variable $Y$ that only takes on values in the interval $(a, b)$, we could model it as \[ Y = (b - a) X + a, \] where $X$ is a beta random variable. However, $Y$ would no longer follow a beta distribution because its support is not $(0, 1)$.

42.2 Maximum Likelihood Estimation

Now that we can model data using the beta distribution with non-integer values of $\alpha$ and $\beta$, the question is how do we find the best values of $\alpha$ and $\beta$. The answer is maximum likelihood, of course, similar to Section 39.2.

The likelihood for an i.i.d. sample $X_1, \dots, X_n$ from a $\text{Beta}(\alpha,\beta)$ distribution is \[ L(\alpha, \beta) = \prod_{i=1}^n \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} X_i^{\alpha - 1} (1 - X_i)^{\beta - 1}, \] so the log-likelihood is \[ \ell(\alpha, \beta) = n\log\Gamma(\alpha+\beta) - n\log\Gamma(\alpha) - n\log\Gamma(\beta) + (\alpha - 1)\sum_{i=1}^n \log X_i + (\beta - 1) \sum_{i=1}^n \log(1 - X_i).\]

Equivalently, to eliminate extraneous factors of $n$, we can instead maximize \[ J(\alpha, \beta) = \frac{1}{n} \ell(\alpha, \beta). \]

Recalling the digamma function $\psi(\alpha) \overset{\text{def}}{=}\frac{d}{d\alpha} \log \Gamma(\alpha)$ from Example 39.2, we can calculate the gradient of $J$ to obtain the first-order conditions: \[ \begin{align} \frac{\partial J}{\partial\alpha} &= \psi(\alpha+\beta) - \psi(\alpha) + \frac{1}{n}\sum_{i=1}^n \log X_i = 0 \\ \frac{\partial J}{\partial\beta} &= \psi(\alpha+\beta) - \psi(\beta) + \frac{1}{n}\sum_{i=1}^n \log (1 - X_i) = 0 \\ \end{align} \tag{42.4}\]

Solving these equations by hand is hopeless because there are $\alpha$s and $\beta$s nested inside digamma functions. However, for any given data set, we can find the optimal $\alpha$ and $\beta$ numerically using Newton’s method, as we did in Example 39.2.

Example 42.1 (Newton’s method for maximum likelihood) Since we are optimizing over two variables, $\alpha$ and $\beta$, we need a multivariable version of Equation 39.10. The multivariable analog is \[ \begin{pmatrix} \alpha \\ \beta \end{pmatrix} \leftarrow \begin{pmatrix} \alpha \\ \beta \end{pmatrix} - \Big[J''(\alpha, \beta)\Big]^{-1} J'(\alpha, \beta), \tag{42.5}\] where $J'$ is the gradient vector of first (partial) derivatives and $J''$ is the Hessian matrix of second (partial) derivatives.

We already calculated the gradient in Equation 42.4. That is, the gradient vector is \[ J'(\alpha, \beta) = \begin{pmatrix} \psi(\alpha+\beta) - \psi(\alpha) + \frac{1}{n}\sum_{i=1}^n \log X_i \\ \psi(\alpha+\beta) - \psi(\beta) + \frac{1}{n}\sum_{i=1}^n \log (1 - X_i) \end{pmatrix} \]

To calculate the Hessian, we calculate another partial derivative:

\[\begin{align} \frac{\partial^2 J}{\partial\alpha^2} &= \psi'(\alpha+\beta) - \psi'(\alpha) \\ \frac{\partial^2 J}{\partial\alpha \partial\beta} = \frac{\partial^2 J}{\partial\beta \partial\alpha} &= \psi'(\alpha+\beta) \\ \frac{\partial^2 J}{\partial\beta^2} &= \psi'(\alpha+\beta) - \psi'(\beta) \\ \end{align}\]

So the Hessian is the matrix \[ \big[J''(\alpha, \beta)\big] = \begin{bmatrix} \frac{\partial^2 J}{\partial \alpha^2} & \frac{\partial^2 J}{\partial \alpha \partial\beta} \\ \frac{\partial^2 J}{\partial \alpha \partial\beta} & \frac{\partial^2 J}{\partial \beta^2} \end{bmatrix} = \begin{bmatrix} \psi'(\alpha+\beta) - \psi'(\alpha) & \psi'(\alpha+\beta) \\ \psi'(\alpha+\beta) & \psi'(\alpha+\beta) - \psi'(\beta) \end{bmatrix} \]

Now, we implement Newton’s method on the humidity data from above in R.

Notice that the MLE for $\beta$ is approximately $1.63$, which is in between the integer values of $1$ and $2$ that we tried previously.

Finally, we plot the estimated beta distribution on top of the data.

42.3 Connection between the Gamma and Beta Distributions

There is a deep connection between the gamma distribution introduced in Chapter 39 and the beta distribution. In particular, if $X$ and $Y$ are independent gamma random variables, then it turns out that $\displaystyle \frac{X}{X + Y}$ (which is always a number between $0$ and $1$) follows a beta distribution.

Proposition 42.3 (Gamma-beta connection) Let $X$ be $\text{Gamma}(\alpha, \lambda)$ and $Y$ be $\text{Gamma}(\beta, \lambda)$, independent of one another. Then, \[ T = \frac{X}{X + Y} \sim \text{Beta}(\alpha, \beta). \tag{42.6}\]

Moreover, $T$ is independent of $X + Y$.

Proof

Let $S = X + Y$. We will determine, in order, the distributions of

$X | S$,
$T | S$, and
$T$.

First, we want to know the distribution of $X | S$, but it is easier to determine the distribution of $S | X$. Therefore, we should use Bayes’ rule for random variables (Proposition 26.2) to relate the two conditional distributions: \[ f_{X | S}(x | s) = \frac{f_{X}(x) f_{S | X}(s | x)}{f_S(s)}. \tag{42.7}\]

Looking at each term on the right-hand side of Equation 42.7:

The distribution of $X$ is $\text{Gamma}(\alpha, \lambda)$ by assumption.
The distribution of $S$ is $\text{Gamma}(\alpha + \beta, \lambda)$, as we determined in Equation 39.7.
The conditional distribution $S|X$ can be derived as follows:
- $(X + Y) | \{ X = x \}$ has the same distribution as $(x + Y) | \{ X = x \}$, and since $X$ and $Y$ are independent, the same distribution as $x + Y$.
- But $x+Y$ is just a location transformation of $Y$, so its PDF is $f_{S|X}(s|x) = f_Y(s - x)$.

Substituting the corresponding PDFs into Equation 42.7, the conditional PDF of $X | S$ is \[\begin{align} f_{X | S}(x | s) &= \frac{f_{X}(x) f_{Y}(s - x)}{f_S(s)} \\ &= \frac{\left(\frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\lambda x}\right) \left(\frac{\lambda^\beta}{\Gamma(\beta)} (s - x)^{\beta - 1} e^{-\lambda (s - x)}\right) }{\frac{\lambda^{\alpha+\beta}}{\Gamma(\alpha+\beta)} s^{\alpha + \beta - 1} e^{-\lambda s}} \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \left(\frac{x}{s}\right)^{\alpha - 1} \left(1 - \frac{x}{s}\right)^{\beta - 1} \frac{1}{s}; 0 < x < s. \end{align}\]

Next, we determine the distribution of $T | S$ as follows:

Note that $T = \frac{X}{S}$, and $\frac{X}{S} | \{ S = s \}$ has the same distribution as $\frac{X}{s} | \{S = s\}$.
But this is just a scale transformation of $X$ (under its conditional distribution given $S$), so its PDF is $f_{T|S}(t|s) = s f_{X|S}(st | s)$.

Substituting the known expression for $f_{X|S}$ from above, we obtain the conditional PDF of $T | S$: \[\begin{align} f_{T|S}(t | s) &= s \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} \left(1 - t\right)^{\beta - 1} \frac{1}{s} \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} \left(1 - t\right)^{\beta - 1}; & 0 < t < 1. \end{align}\]

But this conditional PDF does not involve $s$ at all! This implies that $T$ is independent of $S$, so this must also be the marginal PDF of $T$. To be precise, we can use the Law of Total Probability to obtain the marginal PDF: \[ f_T(t) = \int_{-\infty}^\infty f_S(s) f_{T|S}(t|s)\,ds = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} t^{\alpha - 1} \left(1 - t\right)^{\beta - 1} \underbrace{\int_{-\infty}^\infty f_S(s) \,ds}_1. \] which we recognize from Equation 42.3 as the PDF of a $\text{Beta}(\alpha, \beta)$ random variable .

Proposition 42.3 is a beautiful connection with a simple interpretation. If $X$ and $Y$ represent “times” spent doing two activities (e.g., doing homework and studying for exams), then $\frac{X}{X + Y}$ represents the fraction of time in the first activity. If we model $X$ and $Y$ as independent gamma random variables, then Proposition 42.3 says that $\frac{X}{X + Y}$ will be beta and, moreover, will be independent of $X + Y$, the total time spent doing both activities.

Proposition 42.3 also allows for a charming alternative derivation of the expectation of the beta distribution.

Example 42.2 (Another proof of the beta expectation) Let $T$ be $\text{Beta}(\alpha, \beta)$. Then, by Proposition 42.3, we can write \[ T = \frac{X}{X + Y}, \] where $X$ and $Y$ are independent $\text{Gamma}(\alpha, \lambda)$ and $\text{Gamma}(\beta, \lambda)$, respectively.

Moreover, $T$ is independent of $S = X + Y$, so by Theorem 24.3, \[\begin{align} \text{E}\!\left[ TS \right] &= \text{E}\!\left[ T \right] \text{E}\!\left[ S \right] \\ \text{E}\!\left[ X \right] &= \text{E}\!\left[ T \right] \text{E}\!\left[ X + Y \right], \end{align}\] and \[ \text{E}\!\left[ T \right] = \frac{\text{E}\!\left[ X \right]}{\text{E}\!\left[ X + Y \right]}. \]

Since $X$ and $Y$ are both gamma random variables, we know their expectations by Proposition 39.2, so \[ \text{E}\!\left[ T \right] = \frac{\frac{\alpha}{\lambda}}{\frac{\alpha+\beta}{\lambda}} = \frac{\alpha}{\alpha + \beta}. \]

42.4 Exercises

Exercise 42.1 (Power of uniform is beta) Suppose $U \sim \textrm{Uniform}(a= 0, b= 1)$. Show that $X = U^k$ follows a beta distribution (with what parameters?) for $k > 0$.

Exercise 42.2 (Moments of the beta distribution) Let $X \sim \text{Beta}(\alpha,\beta)$. Show that the $k$th moment of $X$ is \[ \text{E}\!\left[ X^k \right] = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha+\beta+k)} \cdot \frac{\Gamma(\alpha+k)}{\Gamma(\alpha)}. \]

Exercise 42.3 (Limiting distribution of order statistics) Let $U_1, \dots, U_n$ be i.i.d. $\textrm{Uniform}(a= 0, b= 1)$. For a fixed integer $1 \leq k \leq n$, show that $n U_{(k)}$ converges in distribution as $n \to \infty$. Identify the limiting distribution.

Exercise 42.4 (Posterior distribution of a binomial probability) Suppose you have a coin minting machine, which generates a coin with heads probability $p$, where $p$ is chosen uniformly between $0$ and $1$. Suppose you mint a coin and toss it $m+n$ times. What is the conditional distribution of $p$ given that there were $n$ heads?

Exercise 42.5 (Generalizing the gamma-beta connection) Suppose that $X_1, \dots, X_n$ are independent $\text{Gamma}(\alpha_i, \lambda)$ random variables. Determine the distribution of $\displaystyle \frac{X_1}{X_1 + \dots + X_n}$, and show that it is independent of $X_1 + \dots + X_n$.

Hint: You should be able to do this with minimal calculation if you use the result of Proposition 42.3 and properties of the gamma distribution.

Exercise 42.6 (Another proof of beta variance) Let $X$ and $Y$ be independent $\text{Gamma}(\alpha, \lambda)$ and $\text{Gamma}(\beta, \lambda)$ random variables.

Argue that \[ \text{E}\!\left[ \frac{X^c}{(X + Y)^c} \right] = \frac{\text{E}\!\left[ X^c \right]}{\text{E}\!\left[ (X + Y)^c \right]} \] for all real numbers $c$.
Use this fact to rederive the variance of the $\text{Beta}(\alpha, \beta)$ distribution.

Exercise 42.7 (Ratio of gammas) Let $X$ and $Y$ be independent $\text{Gamma}(\alpha, \lambda)$ and $\text{Gamma}(\beta, \lambda)$ random variables.

Express $\frac{X}{Y}$ in terms of a beta random variable $B$. Specify the distribution of $B$. (Hint: Use the results of Proposition 42.3.)
Use the result above to find the PDF of $\frac{X}{Y}$.
Find the PDF of $W \overset{\text{def}}{=}\frac{X / \alpha}{Y / \beta}$. $W$ is said to follow an $F$-distribution with $2\alpha$ and $2\beta$ degrees of freedom. (Hint: This is simply a scale-transformation of the random variable from part (b).)