29  The Calculus of Maximum Likelihood

In the last chapter, we saw how the MLE enables us to learn about unknown data generating distribution from the observed data. As we saw, however, finding the MLE can involve somewhat lengthy computations.

In this chapter, we present a standard strategy for finding MLEs that often greatly simplifies the required computation. By the end of the chapter, you’ll be well equipped to compute MLEs in a wide variety of problems!

29.1 The Log Likelihood

At the end of Chapter 28, we showed how to compute the MLE for \(p\) after observing a random variable \(X\) with \(\text{Binomial}(n, p)\) distribution. Because \(p \in [0,1]\) has continuous range of possible values and the likelihood

\[ L_{X}(p) = \binom{n}{X} p^X (1 - p)^{n - X}. \]

is a differentiable function of \(p\), we did this by taking the derivative of the likelihood with respect to \(p\) and setting it equal to zero. This compuation, however, was a bit nastier than it needed to be. Like most likelihoods we will encounter, the likelihood \(L_{X}(p)\) is the product of terms involving the parameter \(p\). Taking the derivative of the likelihood with respect to \(p\) therefore requires the product rule, which can be cumbersome and messy. If we instead take the log of the likelihood,

\[\begin{align*} \ell_X(p) &= \log(L_{X}(p))\\ &= \log\left(\binom{n}{X} p^X (1 - p)^{n - X}\right) \\ &= \log \binom{n}{X} + X \log p + (n-X) \log(1-p) \end{align*}\]

the product turns into a sum. The log of the likelihood is aptly named the log-likelihood.

Definition 29.1 (Log-likelihood) Suppose we observe data \(X\) that has PMF (or PDF) \(f_\theta(x)\) for some unknown parameter(s) \(\theta\). The log-likelihood of \(\theta\) is defined as \[ \ell_{X}(\theta) = \log(L_{X}(\theta)) = \log(f_\theta(X)) \tag{29.1}\]

where \(L_X(\theta)\) is the likelihood of \(\theta\) from Definition 28.1.

Because \(\log(\cdot)\) is a monotone function (larger values of \(y\) result in a larger values of \(\log(y)\)) the same \(p\) that maximizes the log-likelihood also maximizes the likelihood. Since the derivative of a sum is just the sum of the derivatives, finding the \(p\) that maximizes the log-likelihood id usually computationally easier.

Example 29.1 redoes our MLE computation for the binomial likelihood, but this time maximizes the log-likelihood. Although the computation is similarly lengthy, it is cleaner, and each individual step is very simple.

Example 29.1 (Binomial MLE via the log-likelihood) Suppose we observe data \(X\) which has a \(\text{Bin}(n, p)\) distribution. What is the MLE for \(p\)?

We will find the MLE by finding the \(p\) that maximizes the log likelihood

\[ \ell_X(\theta) = \log \binom{n}{X} + X \log p + (n-X) \log(1-p). \]

To find the \(p\) that maximizes \(\ell_X(\theta)\), we first find the derivative of \(\ell_X(\theta)\).

\[\begin{align*} &\frac{d}{dp} \ell_{X}(p) \\ &= \frac{d}{dp} \left( \log \binom{n}{X} + X \log p + (n-X) \log(1-p) \right) & \text{(definition of log-likelihood)} \\ &= \frac{d}{dp} \log \binom{n}{X} + X \cdot \frac{d}{dp} \log p + (n-X) \cdot \frac{d}{dp} \log(1-p) & \text{(linearity of differentiation)}\\ &= \frac{X}{p} - \frac{n-X}{1-p} & \text{(evaluate derivatives)} \end{align*}\]

The MLE \(\hat{p}\), which is the maximizing value of \(\ell_X(\theta)\), sets this derivative to zero.

\[\begin{align*} &\frac{X}{\hat{p}} - \frac{n-X}{1-\hat{p}} = 0 & \text{(set derivative to zero)} \\ &\implies (1-\hat{p})X - \hat{p}(n-X) & \text{(multiply both sides by } \hat{p}(1-\hat{p}) \text{)}\\ &\implies X -n\hat{p} = 0 & \text{(simplify)}\\ &\implies \hat{p} = \frac{X}{n} & \text{(solve for } \hat{p} \text{)} \end{align*}\]

Example 29.2 provides another example of how working with log-likelihoods makes it easy to find MLEs.

Example 29.2 (Geometric MLE via the log-likelihood) Your friend claims to be an excellent free-throw shooter. You’re a bit skpetical, so they tell you that the last time they practiced their free-throws they made seven in a row before missing their eighth. Assuming the freethrows are independent and your friend had the same probability of making each one, what’s a good estimate for their probability \(p\) of making a freethrow?

Because your friend shot free-throws until missing, the (random) number of freethrows \(X\) they took has a \(\text{Geometric}(1 - p)\) distribution (Definition 9.2). To get an estimate, we find the MLE of \(p\) for the observed data \(X = \textcolor{blue}{8}\). The likelihood function is given by

\[ L_{\textcolor{blue}{7}}(p) = p^{\textcolor{blue}{8} - 1}(1-p). \]

Correspondingly, the log likelihood function is \[ \ell_X(\theta) = (\textcolor{blue}{8} -1) \log p + \log(1-p) \]

Identical computations to those in Example 29.1 show that the derivative of \(\ell_X(\theta)\) is given by

\[ \frac{d}{dp} \ell_{X}(p) = \frac{(\textcolor{blue}{8} -1)}{p} - \frac{1}{1-p} \]

The MLE \(\hat{p}\), which is the maximizing value of \(\ell_X(\theta)\), sets this derivative to zero. Following the computations in Example 29.1, we will find that

\[ \hat{p} = \frac{\textcolor{blue}{8} -1}{\textcolor{blue}{8}} = \frac{7}{8} \]

In general, if your friend’s first miss happens on freethrow \(X\), the MLE estimate of \(p\) would be

\[\hat{p} = \frac{X - 1}{X} = 1 - \frac{1}{X}. \]

The MLE for \(p\) in Example 29.2 is the number of made freethrows over the number of total freethrows, very similar to the binomial case. In fact, if our friend told us that they made seven out of eight freethrows instead of teeling us that they made seven free throws before missing one, the MLE for \(p\) would remain the same. The data generating distribution in these two cases, however, is not the same (one is binomial while the other is geometric).

Example 29.3 gives another example where it’s easier to work with the log-likelihood in place of the likelihood. It also clearly illustrates an important point: when working with continuous random variables, likelihoods and probabilities are not the same.

Example 29.3 (Normal MLE via the log-likelihood) Suppose \(X\) has a normal distribution (Definition 22.4) with unknown mean \(\mu\) and standard deviation \(\sigma = \textcolor{red}{0.3}\). If we observe \(X = \textcolor{blue}{3}\), what is the MLE for \(\mu\)?

The below code plots the likelihood function, which is given by \[ L_{\textcolor{blue}{3}}(\mu) = \frac{1}{\textcolor{red}{0.3} \cdot \sqrt{ 2 \pi}} e^{-(\textcolor{blue}{3} - \mu)^2/(2 \cdot \textcolor{red}{0.3}^2 ) }. \]

The likelihood of \(\mu\) appears to be maximized at our observed value \(\mu = \textcolor{blue}{3}\). Given that the normal PDF achieves its peak at the mean, this shouldn’t be too surprising.

Let’s now explicitly show that the MLE is \(\hat{\mu}= \textcolor{blue}{3}\). Taking a log eliminates the \(e\) in the likelihood, and the log-likelihood

\[ \ell_{\textcolor{blue}{3}}(\mu) = -\frac{1}{2}\log(\textcolor{red}{0.3}^2) -\frac{1}{2}\log(2\pi) - \frac{1}{2\cdot \textcolor{red}{0.3}^2 }(\textcolor{blue}{3} - \mu)^2 \]

is much easier to study. It is clear that the log-likelihood achieves its maximum value of \(-\frac{1}{2}\log(\textcolor{red}{0.3}^2) -\frac{1}{2}\log(2\pi)\) when we plug in \(\mu = \textcolor{blue}{3}\). If we plug in any other value, the term \((\textcolor{blue}{3} - \mu)^2\) will be positive and the overall log-likelihood will decrease. Hence, the MLE must be \(\hat{\mu}= 3\).

In general, our argument implies that, regardless of what value of \(X\) we observe, the MLE is given by \(\hat{\mu} = X\).

Likelihoods are not probabilities

When we are working with continuous random variables, likelihoods are {.underline}[not] probabilities. We can see from the plot that, in this problem, the likelihood of \(\mu=3\)

\[ \ell_{\text{3}}(3) \approx 1.26 \]

is greater than one! It’s best to think of likelihoods and probabilities as two conceptually distinct ideas. The former is used to infer something about the unknown data generating distribution, while the later is the chance of something happening under some (often known) data generating distribution.

29.2 Estimation with Independent and Identically Distributed Random Variables

Working with log-likelihoods is particularly handy whenever we observe independent and identically distributed random variables, as defined below.

Definition 29.2 (Independent and identically distributed) We say that random variables \(X_1, \dots, X_n\) are independent and identically distributed (i.i.d) when they are independent and have the same distribution (i.e., they all have the same PMF or PDF).

When random variables \(X_1, \dots, X_n\) are i.i.d, they all have the same PMF (or PDF) \(f_{\theta}(x)\) and, due to independence, their joint PMF (of PDF) factors (recall Definition 13.2 and Definition 23.2):

\[ f_{\theta}(X_1, \dots, X_n) = f_{\theta}(X_1) \times \dots \times f_{\theta}(X_n). \]

The likelihood of \(\theta\) for the observed data \(X_1, \dots, X_n\) is hence always a big product, \[ L_{X_1, \dots, X_n}(\theta) = f_{\theta}(X_1) \times \dots \times f_{\theta}(X_n), \]

and it is almost always easier to work with the log-likelihood, \[ \ell_{X_1, \dots, X_n}(\theta) = \log(f_{\theta}(X_1)) + \times \dots + \log(f_{\theta}(X_n)), \] which is a sum. Because the subscript on the likelihood starts to become cumbersome when we observe \(n\) datapoints, we omit it from here on out and simply write the likelihood and log-likelihood as \(L(\theta)\) and \(\ell(\theta)\) respectively.

Our next example generalizes the earlier Example 29.3 by supposing we observe many i.i.d normal samples with unknown standard deviation.

Example 29.4 (Normal MLE via the log-likelihood) Suppose \(X_1, \dots, X_n\) are i.i.d random variables that have a normal distribution (Definition 22.4) with unknown mean \(\mu\) and also unknown variance \(\sigma^2\). What are the MLEs for \(\mu\) and \(\sigma\)?

Because of independence, the likelihood for a \(\mu\) and \(\sigma^2\) pair is given by the product

\[\begin{align*} L(\mu, \sigma^2) &= f_{\mu, \sigma}(X_1, \dots, X_n) & \text{(definition of likelihood)}\\ &= \prod_{i=1}^n f_{\mu, \sigma}(X_i) & \text{(independence)}\\ \end{align*}\] where \(f_{\mu, \sigma}\) is the PDF of a normal random variable with mean \(\mu\) and variance \(\sigma^2\).

Per usual, rather than deal with the product, we instead examine the log-likelihood. Using our above computations and our earlier work from Example 29.3, we find that the log-likelihood is

\[ \ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) -\frac{n}{2}\log \sigma^2 - \frac{1}{2 \sigma^2 } \sum_{i=1}^n (X_i - \mu)^2. \]

Our computations from Example 29.3 imply that

\[ \log f_{\mu, \sigma}(X_i) = \left(-\frac{1}{2}\log(2\pi) -\frac{1}{2}\log(\sigma^2) - \frac{1}{2 \sigma^2 }(X_i - \mu)^2 \right) \]

Using this fact, we can write down the log-likelihood.

\[\begin{align*} \ell(\mu, \sigma^2) &= \log L(\mu, \sigma^2) & \text{(definition of log-likelihood)}\\ &= \log \left( \prod_{i=1}^n f_{\mu, \sigma}(X_i)\right) & \text{(definition of likelihood)}\\ &= \sum_{i=1}^n \log f_{\mu, \sigma}(X_i) & \text{(properties of log)}\\ &= \sum_{i=1}^n \left( -\frac{1}{2}\log(2\pi) -\frac{1}{2}\log \sigma^2 - \frac{1}{2 \sigma^2 }(X_i - \mu)^2 \right) & \text{(earlier computation)}\\ &= -\frac{n}{2}\log(2\pi) -\frac{n}{2}\log \sigma^2 - \frac{1}{2 \sigma^2 } \sum_{i=1}^n (X_i - \mu)^2 & \text{(simplify)} \end{align*}\]

The MLEs for \(\mu\) and \(\sigma^2\) are the pair that jointly maximize this log-likelihood. Hence, they must set both the partial derivatives to zero:

\[ \frac{\partial}{\partial \mu} \ell(\hat{\mu}, \hat{\sigma}^2) = 0 \]

\[ \frac{\partial}{\partial \sigma^2} \ell(\hat{\mu}, \hat{\sigma}^2) = 0 \]

We compute these partial derivatives one at a time.

MLE of \(\mu\): We start with the derivative with respect to \(\mu\). This derivative is given by

\[ \frac{\partial}{\partial \mu} \ell(\mu, \sigma^2) = \frac{1}{\sigma^2} \left( (\sum_{i=1}^n X_i) - n\mu \right) \]

\[\begin{align*} &\frac{\partial}{\partial \mu} \ell(\mu, \sigma^2) \\ &= \frac{\partial}{\partial \mu} \left( -\frac{n}{2}\log(2\pi) -\frac{n}{2}\log \sigma^2 - \frac{1}{2 \sigma^2 } \sum_{i=1}^n (X_i - \mu)^2 \right) & \text{(earlier computation)} \\ &= - \frac{\partial}{\partial \mu}\frac{n}{2}\log(2\pi) - \frac{\partial}{\partial \mu} \frac{n}{2}\log \sigma^2 - \frac{\partial}{\partial \mu}\frac{1}{2 \sigma^2 } \sum_{i=1}^n (X_i - \mu)^2 \frac{\partial}{\partial \mu} (X_i - \mu)^2 & \text{(linearity of derivative)} \\ &= \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \mu) & \text{(evaluate derivatives)} \\ &= \frac{1}{\sigma^2} \left( (\sum_{i=1}^n X_i) - n\mu \right) & \text{(simplify)} \\ \end{align*}\]

The MLEs \(\hat{\mu}\) and \(\hat{\sigma}^2\) must set this derivative to zero, and therefore satisfy

\[\begin{align*} &\frac{1}{\hat{\sigma}^2} \left( (\sum_{i=1}^n X_i) - n \hat{\mu} \right) = 0 & \text{(set derivative to zero)}\\ &\implies \sum_{i=1}^n X_i - n \hat{\mu} = 0 & \text{(multiply both sides by $\hat{\sigma}^2$)} \\ &\implies \hat{\mu} = \frac{1}{n}\sum_{i=1}^n X_i. & \text{(solve for $\hat{\mu}$)} \end{align*}\]

Essentially, this derivative condition determines the MLE for \(\mu\), but tells us nothing about the MLE for \(\sigma^2\).

MLE of \(\sigma^2\): To find the MLE for \(\sigma^2\), we take the partial derivative with respect to \(\sigma^2\). This derivative is given by

\[ \frac{\partial}{\partial \sigma^2} \ell(\mu, \sigma^2) = \frac{1}{2\sigma^4} \left( \sum_{i=1}^n (X_i - \mu)^2 - n\sigma^2 \right) \]

\[\begin{align*} &\frac{\partial}{\partial \sigma^2} \ell(\mu, \sigma^2) \\ &= \frac{\partial}{\partial \sigma^2} \left( -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log \sigma^2 - \frac{1}{2 \sigma^2 } \sum_{i=1}^n (X_i - \mu)^2 \right) & \text{(earlier computation)} \\ &= - \frac{\partial}{\partial \sigma^2}\frac{n}{2}\log(2\pi) - \frac{\partial}{\partial \sigma^2} \frac{n}{2}\log \sigma^2 - \frac{\partial}{\partial \sigma^2} \frac{1}{2 \sigma^2} \sum_{i=1}^n (X_i - \mu)^2 & \text{(linearity of derivative)} \\ &= - \frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^n (X_i - \mu)^2 & \text{(evaluate derivatives)} \\ &= \frac{1}{2\sigma^4} \left( \sum_{i=1}^n (X_i - \mu)^2 - n\sigma^2 \right) & \text{(simplify)} \\ \end{align*}\]

The MLEs \(\hat{\mu}\) and \(\hat{\sigma}^2\) must also set this derivative to zero, and therefore also satisfy

\[\begin{align*} & \frac{1}{2\hat{\sigma}^4} \left( \sum_{i=1}^n (X_i - \hat{\mu})^2 - n\hat{\sigma}^2 \right) = 0 & \text{(set derivative to zero)} \\ & \iff \sum_{i=1}^n (X_i - \hat{\mu})^2 - n\hat{\sigma}^2 = 0 & \text{(multiply both sides by $2\hat{\sigma}^4$)} \\ & \iff \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \hat{\mu})^2 & \text{(solve for \(\hat{\sigma}^2\))}. \end{align*}\]

In the end we find that the MLEs are

\[ \hat{\mu} = \frac{1}{n}\sum_{i=1}^n X_i, \]

\[ \hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \hat{\mu})^2. \]

Our next example involves finding the MLE when we observe i.i.d exponential random variables.

Example 29.5 (Exponential scale MLE via the log-likelihood) Suppose we observe i.i.d random variables \(X_1, \dots, X_n\) that have an \(\text{Exponential}(\lambda)\) distribution (Definition 22.2). What is the MLE for \(\lambda\)?

This time, we’ll simply start by computing the log-likelihood

\[\begin{align*} \ell(\lambda) &= \sum_{i=1}^n \log(f_{\lambda}(X_i)) & \text{(definition of log-likelihood)}\\ &= \sum_{i=1}^n \log(\lambda e^{-\lambda X_i}) & \text{(exponential density)}\\ &= \sum_{i=1}^n \left( \log(\lambda) -\lambda X_i \right) & \text{ (log properties) }\\ &= n \log(\lambda) - \lambda \sum_{i=1}^n X_i. \end{align*}\]

The MLE \(\hat{\lambda}\) sets the derivative of the log-likelihood to zero. The derivative of the log-likelihood is

\[ \frac{d}{d\lambda} \ell(\lambda) = \frac{n}{\lambda} - \sum_{i=1}^{n} X_i. \]

Setting it equal to zero and solving for \(\lambda\) yields

\[ \hat{\lambda} = \frac{n}{\sum_{i=1}^{n} X_i}. \]

Our last example also involves i.i.d random variables, but it considers a case where working with the log-likelihood isn’t necessary. In fact, by thinking carefully about the problem, we can get away with doing very little computation at all. Before taking logs and derivatives, always check if there’s a simpler way to maximize the likelihood!

Example 29.6 (Exponential location MLE via the log-likelihood) Suppose we observe i.i.d random variables \(X_1, \dots, X_n\) that follow an \(\text{Exponential}(\lambda)\) distribution shifted by a location parameter \(\mu\). The PDF of each \(X_i\) is given by:

\[ f_{\mu, \lambda}(x) = \begin{cases} \lambda e^{-\lambda (x - \mu)}, & x \geq \mu, \\ 0, & x < \mu. \end{cases} \]

Essentially, the distribution of each \(X_i\) is the same as the distribution of \(Y_i + \mu\), where \(Y_i\) has an \(\text{Exponential}(\lambda)\) distribution. What is the MLE for \(\mu\)?

To gain some intuition, we imagine observing \(X_i = 1\), fix \(\lambda=1\), and plot the density \(f_{\mu, \lambda}(x)\) for various values of \(\mu\), paying particular attention to its value at \(X_i = 1\).

Visually, we see that \(f_{\mu, \lambda}(X_i)\) continues to increase as we increase \(\mu\), at least until we increase \(\mu\) past \(X_i\), at which point \(f_{\mu, \lambda}(X_i)\) becomes zero. By varying the observed value \(X_i\) and the rate parameter \(\lambda\) in the above code, you can check that this remains the case more generally.Still, we ask you to formally verify our claim in Exercise 29.1.

The likelihood,

\[ L_{\mu, \lambda}(X_1, \dots, X_n) = \prod_{i=1}^n f_{\mu, \lambda}(X_i), \] will hence continue to increase as we increase \(\mu\), at least until \(\mu\) exceeds one of the \(X_i\), at which point the likelihood will become zero. Therefore, value of \(\mu\) that maximizes the likelihood must be \[ \hat{\mu} = \min(X_1, X_2, \dots, X_n). \]

29.3 Exercises

Exercise 29.1 Coming Soon!