37 Functions of Means

In Example 30.2, we saw that in order to estimate the rate parameter \(\lambda\) from i.i.d. exponential observations, the MLE is \[ \hat\lambda = \frac{n}{\sum_{i=1}^n X_i} = \frac{1}{\bar X}. \]

In other words, \(\hat\lambda\) is of the form \(g(\bar X)\), where \(g\) is a twice-differentiable function. We know that \(\bar X\) is unbiased and consistent for the mean parameter \(\mu = \text{E}\!\left[ X_i \right]\), as well as asymptotically normal. What can we say about estimators of the form \(g(\bar X)\)?

37.1 Jensen’s Inequality

First of all, \(g(\bar X)\) is not unbiased for \(g(\mu)\) in general. To evaluate \(\text{E}\!\left[ g(\bar X) \right]\), we cannot simply pass the expectation through \(g\); that is, \[ \text{E}\!\left[ g(\bar X) \right] \neq g(\text{E}\!\left[ \bar X \right]). \] The proper way to evaluate \(\text{E}\!\left[ g(\bar X) \right]\) is to use LotUS (Theorem 21.1). However, this requires knowing the exact distribution of \(\bar X\).

Fortunately, there is a way to determine the direction of the bias without evaluating \(\text{E}\!\left[ g(\bar X) \right]\) directly when \(g\) is a convex function.

Definition 37.1 (Convex function) A twice-differentiable function \(g(x)\) is called convex on a set \(A\) if \(g''(x) \geq 0\) for all \(x \in A\), or equivalently if it lies above every tangent line. That is, for any \(x_0, x \in A\), \[ g(x) \geq g(x_0) + g'(x_0) (x - x_0). \tag{37.1}\] See Figure 37.1.

Figure 37.1: A convex function always lies above its tangent lines.

We can prove that \(g''(x) \geq 0\) for all \(x\) implies Equation 37.1. To see why, observe that by Taylor’s theorem with remainder, there exists a value \(c\) between \(x\) and \(x_0\) such that \[ g(x) = g(x_0) + g'(x_0) (x - x_0) + \frac{g''(c)}{2} (x - x_0)^2, \] as we wanted to show.

By assumption, \(g''(c) \geq 0\), so \[ g(x) \geq g(x_0) + g'(x_0) (x - x_0). \]

On the other hand, a function \(g(x)\) is concave if \(g''(x) \leq 0\) for all \(x\) or if it always lies below its tangent lines. Note that if \(g(x)\) is concave, then \(-g(x)\) is convex.

Jensen’s inequality describes how expectation behaves under a convex transformation.

Theorem 37.1 (Jensen’s inequality) Let \(X\) be a random variable with \(\mu = \text{E}\!\left[ X \right] < \infty\), and let \(g(x)\) be convex on an interval containing the support of \(X\). Then \[ \text{E}\!\left[ g(X) \right] \geq g(\text{E}\!\left[ X \right]). \tag{37.2}\]

Moreover, equality holds only if \(g\) is a linear function of \(X\) (with probability \(1\))—that is, only if there exist \(a\) and \(b\) such that \(g(X) = aX + b\). If \(g\) is not linear, then the inequality is strict: \[ \text{E}\!\left[ g(X) \right] > g(\text{E}\!\left[ X \right]). \tag{37.3}\]

Proof

We will prove this under the assumption that \(g\) is twice-differentiable, even though the result holds for any convex function.

From Equation 37.1, we know that \[ g(x) \geq g(\mu) + g'(\mu) (x - \mu) \] for all \(x\). In particular, we can plug in the random variable \(X\) for \(x\) and take expectations to obtain \[ \text{E}\!\left[ g(X) \right] \geq g(\mu) + g'(\mu) \underbrace{\text{E}\!\left[ X - \mu \right]}_0 = g(\mu), \tag{37.4}\] as we wished to show.

Now, suppose that equality holds in Equation 37.2. From Equation 37.4, we know that this means the (non-negative) random variable \(Y \overset{\text{def}}{=}g(X) - g(\mu) - g'(\mu) (X - \mu)\) has expectation \(\text{E}\!\left[ Y \right] = 0\). But the only way for a non-negative random variable \(Y \geq 0\) to have expectation zero is if \(Y = 0\) (with probability 1). Therefore, \[ g(X) = g(\mu) + g'(\mu)(X - \mu), \] so \(g\) is linear in \(X\) with \(a = g'(\mu)\) and \(b = g(\mu) - g'(\mu) \mu\).

There is also a geometric proof of Jensen’s inequality. (Sun 2021)

Here is a simple application of Theorem 37.1. The function \(g(x) = x^2\) is convex. Therefore, Jensen’s inequality tells us that for any random variable \(X\), \[ \text{E}\!\left[ X^2 \right] \geq \text{E}\!\left[ X \right]^2. \] We can rearrange this inequality to obtain \[ \text{E}\!\left[ X^2 \right] - \text{E}\!\left[ X \right]^2 \geq 0. \] But the left-hand side is just the shortcut formula for \(\text{Var}\!\left[ X \right]\) (Proposition 11.2). Jensen’s inequality in this case simply restates the well-known fact that variance is non-negative.

Armed with Jensen’s inequality, we can easily determine the direction of the bias of \(\hat\lambda\).

Example 37.1 (Bias of the Exponential MLE) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Exponential}(\lambda)\). We showed in Example 30.2 that the MLE of \(\lambda\) is \[ \hat\lambda = \frac{1}{\bar X}. \]

Since \(g(x) = 1/x\) is convex on \((0, \infty)\) (the support of \(\bar X\)), we can apply Jensen’s inequality. Moreover, since \(g(x)\) is not linear, this inequality will be strict.

The expectation of the MLE is \[ \text{E}\!\left[ \hat\lambda \right] = \text{E}\!\left[ \frac{1}{\bar X} \right] > \frac{1}{\text{E}\!\left[ \bar X \right]} = \frac{1}{1 / \lambda} = \lambda, \] so the MLE has positive bias. That is, it tends to overestimate \(\lambda\).

37.2 Continuous Mapping Theorem

Although \(g(\bar X)\) is not, in general, unbiased for \(g(\mu)\), it is still consistent. The key result needed to establish consistency is the following.

Theorem 37.2 (Continuous Mapping Theorem) Let \(X_1, \dots, X_n\) be i.i.d. with mean \(\text{E}\!\left[ X_1 \right] = \mu\) and \(g\) be a continuous function on an interval containing the support of \(X_1\). Then, \[ g( \bar{X} ) \stackrel{p}{\to} g(\mu). \]

Proof

The usual proof is a real analysis exercise, which is not only beyond the scope of this book but also unenlightening.

We offer a heuristic argument that provides more insight. We will assume that \(g\) is not only continuous but also admits a Taylor expansion around \(\mu\): \[ g(\bar X) = g(\mu) + g'(\mu) (\bar X - \mu) + \frac{g''(\mu)}{2} (\bar X - \mu)^2 + \dots. \] Every term after the first contains at least one factor of \((\bar X - \mu)\), which converges in probability to \(0\) by Theorem 28.2, so \[ g(\bar X) \stackrel{p}{\to} g(\mu). \] Note that we used the fact that if \(A_n \stackrel{p}{\to} a\) and \(B_n \stackrel{p}{\to} b\), then \(A_n + B_n \stackrel{p}{\to} a + b\) and \(A_n B_n \stackrel{p}{\to} ab\).

Theorem 37.2 is useful for proving consistency of estimators.

Example 37.2 (Consistency of the Exponential MLE) By Theorem 37.2, the MLE \(\hat\lambda = \frac{1}{\bar X}\) converges in probability to \(\frac{1}{\mu} = \frac{1}{1 / \lambda} = \lambda\), so \(\hat\lambda\) is consistent for \(\lambda\).

Consistency means that the estimate should be very close to the truth when \(n\) is large. The code below simulates the sampling distribution of the MLE \(\hat\lambda\) when \(n = 1600\) and \(\lambda = 1.5\).

The estimates nearly always come out to within \(0.10\) of the true value, \(\lambda = 1.5\), as predicted by consistency. Perhaps surprisingly, the sampling distribution also appears to be approximately normal. We explore this phenomenon in the next section.

37.3 Delta Method

Even though \(\bar X\) is approximately normal by the Central Limit Theorem (Theorem 36.1), it may come as a surprise that \(g(\bar X)\) is also approximately normal. For example, in Example 36.5, we saw an example where \(Y\) is a normal random variable, but \(e^Y\) was far from normal. The difference here is that \(\bar X\) is not only approximately normal, but also consistent for \(\mu\). These two facts conspire to make \(g(\bar X)\) also approximately normal.

In this section, we derive the asymptotic distribution of \(g(\bar X)\). To fully appreciate the proof, we will need the following technical lemma, which is optional reading.

Slutsky’s Theorem

Theorem 37.3 (Slutsky’s Theorem) If \(Y_n \stackrel{d}{\to} Y\) and \(A_n \stackrel{p}{\to} a\), then

\(Y_n + A_n \stackrel{d}{\to} Y + a\) and
\(A_n Y_n \stackrel{d}{\to} aY\).

To derive the asymptotic distribution of \(g(\bar X)\), we essentially do a first-order Taylor expansion: \[ g(\bar X) \approx g(\mu) + g'(\mu) (\bar X - \mu). \] Now, the right-hand side is just a linear transformation of \(\bar X\), which is approximately normal by the Central Limit Theorem (Theorem 36.1). Since linear transformations of a normal distribution are still normal (Definition 22.4), \(g(\bar X)\) should also be approximately normally distributed. This recipe is called the delta method.

Theorem 37.4 (Delta Method) Let \(X_1, \dots, X_n\) be i.i.d. with mean \(\text{E}\!\left[ X_1 \right] = \mu\) and variance \(\text{Var}\!\left[ X_1 \right] = \sigma^2\). Let \(g\) be a differentiable function on the support of \(X_1\), with \(g'(\mu) \neq 0\). Then, \[ \sqrt{n} (g(\bar{X}) - g(\mu)) \stackrel{d}{\to} \text{Normal}(0, g'(\mu)^2 \sigma^2). \tag{37.5}\]

Proof

We will assume that \(g\) admits a Taylor expansion around \(\mu\): \[ g(\bar X) = g(\mu) + g'(\mu) (\bar X - \mu) + \frac{g''(\mu)}{2} (\bar X - \mu)^2 + \dots. \]

Rearranging terms and multiplying by \(\sqrt{n}\), we obtain \[ \sqrt{n} (g(\bar X) - g(\mu)) = g'(\mu) \sqrt{n} (\bar X - \mu) + \frac{g''(\mu)}{2} (\bar X - \mu) \sqrt{n} (\bar X - \mu) + \dots. \]

The first term on the right-hand side converges in distribution to \(\text{Normal}(0, g'(\mu)^2 \sigma^2)\), since

The Central Limit Theorem (Theorem 36.1) says that \(\sqrt{n} (\bar X - \mu) \stackrel{d}{\to} \text{Normal}(0, \sigma^2)\).
Multiplying by the constant \(g'(\mu)\) squares the variance. (This can be seen as a trivial application of Theorem 37.3.)

The remaining terms on the right-hand side all converge to \(0\) because they

not only contain a factor of \(Y_n = \sqrt{n} (\bar X - \mu)\), which converges in distribution to a normal distribution,
but also contain additional factors of \(A_n = \bar X - \mu\), which converge in probability to \(0\).

By Theorem 37.3, their product \(A_n Y_n\) must converge in probability to \(0\).

This leaves only the first term on the right-hand side, which is exactly the limiting distribution in Equation 37.5.

Armed with the delta method, we can determine the asymptotic distribution of the exponential MLE.

Example 37.3 (Asymptotic Distribution of the Exponential MLE) For the exponential distribution, \(\mu = \frac{1}{\lambda}\) and \(\sigma^2 = \frac{1}{\lambda^2}\). Since the MLE is \(\hat\lambda = g(\bar X) = \frac{1}{\bar X}\), \(g'(\mu) = -\frac{1}{\mu^2} = -\lambda^2\).

We can use the delta method (Equation 37.5) to conclude that \[ \sqrt{n}(g(\bar X) - g(\mu)) \stackrel{d}{\to} \text{Normal}(0, \lambda^4 \frac{1}{\lambda^2}), \] or equivalently, \[ \sqrt{n}(\hat\lambda - \lambda) \stackrel{d}{\to} \text{Normal}(0, \lambda^2).\]

Rearranging terms, we see that the asymptotic distribution is \[ \hat\lambda \stackrel{\cdot}{\sim} \text{Normal}(\lambda, \frac{\lambda^2}{n}). \]

Let’s add this normal curve to the histogram from earlier, where \(n=1600\) and \(\lambda=1.5\). It is a very good approximation!

estimates <- replicate(10000, {
    x <- rexp(1600, rate=1.5)
    1 / mean(x)
  })
hist(estimates, breaks=30, freq=FALSE)
curve(dnorm(x, mean=1.5, sd=sqrt(1.5^2 / 1600)),
      col="red", add=TRUE)

In summary, the MLE of the rate parameter \(\lambda\) from \(n\) i.i.d. exponential observations \[\hat\lambda = \frac{1}{\bar X}\] has positive bias, but this bias is negligible for large \(n\), and \(\hat\lambda\) is asymptotically normal, centered around \(\lambda\).

37.4 Exercises

Exercise 37.1 (Asymptotic distribution of the geometric MLE) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Geometric}(p)\).

Find \(\hat{p}_\textrm{MLE}\), the MLE of \(p\). Use Jensen’s inequality to determine whether it has positive or negative bias.
Is \(\hat{p}_\textrm{MLE}\) a consistent estimator of \(p\)?
What is the asymptotic distribution of \(\hat{p}_\textrm{MLE}\)?
Perform \(N = 10000\) simulations of \(n = 1000\) (as in Example 37.3) and fit a normal curve based on part (c) to verify your work in this problem.

Remark. Note that \(\texttt{rgeom}\) counts the number of tails, not the total number of tosses, so it is off by \(1\) with our definition of \(\text{Geometric}(p)\). Adjust accordingly!

Exercise 37.2 (Asymptotic distribution of the squared mean) Let \(X_1, \dots, X_n\) be i.i.d. with \(\text{E}\!\left[ X_1 \right] = \mu \neq 0\) and \(\text{Var}\!\left[ X_1 \right] = \sigma^2\). What does \[ \sqrt{n} \left( \bar{X}^2 - \mu^2 \right) \] converge to, in distribution? Use this to give asymptotic distribution of \(\bar{X}^2\).

Exercise 37.3 (Asymptotic distribution of the odds ratio) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Bernoulli}(p)\). Suppose we approximate the odds \(\displaystyle \frac{p}{1-p}\) with \[ \hat{\text{odds}} = \frac{\bar{X}}{1 - \bar{X}}. \] What is the asymptotic distribution of \(\hat{\text{odds}}\)?