37  Functions of Means

In Example 29.5, we found that the MLE of the scale parameter \(\lambda\) is \[ \hat\lambda = \frac{n}{\sum_{i=1}^n X_i} = \frac{1}{\bar X}. \]

In other words, \(\hat\lambda\) is of the form \(g(\bar X)\), where \(g\) is a continuous function (at least over the support of \(\bar X\)). We know that \(\bar X\) is unbiased and consistent for the mean parameter \(\mu = \text{E}\!\left[ X_i \right]\), as well as asymptotically normal. What can we say about estimators like \(g(\bar X)\)?

37.1 Jensen’s Inequality

First of all, \(g(\bar X)\) is not unbiased for \(g(\mu)\) in general. To evaluate \(\text{E}\!\left[ g(\bar X) \right]\), we cannot simply pass the expectation through \(g\); that is, \[ \text{E}\!\left[ g(\bar X) \right] \neq g(\text{E}\!\left[ \bar X \right]). \] The proper way to evaluate \(\text{E}\!\left[ g(\bar X) \right]\) is to use LOTUS (Theorem 21.1). However, this requires knowing the exact distribution of \(\bar X\).

Fortunately, there is a way to determine the direction of the bias without evaluating \(\text{E}\!\left[ g(\bar X) \right]\) directly when \(g\) is a convex function.

Definition 37.1 (Convex function) A function \(g(x)\) is called convex if \(g''(x) \geq 0\) for all \(x\), or equivalently if it always lies above its tangent line: \[ g(x) \geq g(x_0) + g'(x_0) (x - x_0) \tag{37.1}\] for all \(x_0\). See Figure 37.1.

Figure 37.1: A convex function always lies above its tangent lines.

On the other hand, a function \(g(x)\) is concave if \(g''(x) \leq 0\) for all \(x\) or if it always lies below its tangent lines. Note that if \(g(x)\) is concave, then \(-g(x)\) is convex.

Jensen’s inequality describes how expectation behaves under a convex transformation.

Theorem 37.1 (Jensen’s inequality) Let \(X\) be a random variable with \(\mu = \text{E}\!\left[ X \right] < \infty\), and let \(g(x)\) be a convex function on the support of \(X\). That is, \(g''(x) \geq 0\) for all \(x\) in the support of \(X\). Then

\[ \text{E}\!\left[ g(X) \right] \geq g(\text{E}\!\left[ X \right]). \tag{37.2}\]

Moreover, equality holds only if \(g\) is a linear function of \(X\) (with probability \(1\))—that is, only if there exist \(a\) and \(b\) such that \(g(X) = aX + b\). If \(g\) is not linear, then the inequality is strict: \[ \text{E}\!\left[ g(X) \right] > g(\text{E}\!\left[ X \right]). \tag{37.3}\]

Proof

From Equation 37.1, we know that \[ g(x) \geq g(\mu) + g'(\mu) (x - \mu) \] for all \(x\). In particular, we can plug in the random variable \(X\) for \(x\) and take expectations to obtain \[ \text{E}\!\left[ g(X) \right] \geq g(\mu) + g'(\mu) \underbrace{\text{E}\!\left[ X - \mu \right]}_0 = g(\mu), \tag{37.4}\] as we wished to show.

Now, suppose that equality holds in Equation 37.2. From Equation 37.4, we know that this means the (non-negative) random variable \(Y \overset{\text{def}}{=}g(X) - g(\mu) - g'(\mu) (X - \mu)\) has expectation \(\text{E}\!\left[ Y \right] = 0\). But the only way for a non-negative random variable \(Y \geq 0\) to have expectation zero is if \(Y = 0\) (with probability 1). Therefore, \[ g(X) = g(\mu) + g'(\mu)(X - \mu), \] so \(g\) is linear in \(X\) with \(a = g'(\mu)\) and \(b = g(\mu) - g'(\mu) \mu\).

There is also a geometric proof of Jensen’s inequality. (Sun 2021)

Here is a simple application of Theorem 37.1. The function \(g(x) = x^2\) is convex. Therefore, Jensen’s inequality tells us that for any random variable \(X\), \[ \text{E}\!\left[ X^2 \right] \geq \text{E}\!\left[ X \right]^2. \] We can rearrange this inequality to obtain \[ \text{E}\!\left[ X^2 \right] - \text{E}\!\left[ X \right]^2 \geq 0. \] But the left-hand side is just the shortcut formula for \(\text{Var}\!\left[ X \right]\) (Proposition 11.3). Jensen’s inequality in this case simply restates the well-known fact that variance is non-negative.

Armed with Jensen’s inequality, we can easily determine the direction of the bias of \(\hat\lambda\).

Example 37.1 (Bias of the Exponential MLE) Let \(X_1, \dots, X_n\) be i.i.d. \(\text{Exponential}(\lambda)\). We showed in Example 29.5 that the MLE of \(\lambda\) is \[ \hat\lambda = \frac{1}{\bar X}. \]

Since \(g(x) = 1/x\) is a convex function for \(x > 0\), the support of \(\bar X\), we can apply Jensen’s inequality. Moreover, since \(g(x)\) is not linear, this inequality will be strict.

The expectation of the MLE is \[ \text{E}\!\left[ \hat\lambda \right] = \text{E}\!\left[ \frac{1}{\bar X} \right] > \frac{1}{\text{E}\!\left[ \bar X \right]} = \frac{1}{1 / \lambda} = \lambda, \] so the MLE has positive bias. That is, it tends to overestimate \(\lambda\).

37.2 Continuous Mapping Theorem

Although \(g(\bar X)\) is not, in general, unbiased for \(g(\mu)\), it is still consistent. The key result needed to establish consistency is the following.

Theorem 37.2 (Continuous Mapping Theorem) Let \(X_1, \dots, X_n\) be i.i.d. with mean \(\text{E}\!\left[ X_1 \right] = \mu\) and \(g\) be a continuous function. Then, \[ g( \bar{X} ) \stackrel{p}{\to} g(\mu). \]

The usual proof is a real analysis exercise, which is not only beyond the scope of this book but also unenlightening.

We offer a heuristic argument that provides more insight. We will assume that \(g\) is not only continuous but also differentiable, admitting a Taylor expansion around \(\mu\): \[ g(\bar X) = g(\mu) + g'(\mu) (\bar X - \mu) + \frac{g''(\mu)}{2} (\bar X - \mu)^2 + \dots. \] Every term after the first contains at least one factor of \((\bar X - \mu)\), which converges in probability to \(0\) by Theorem 32.1, so \[ g(\bar X) \stackrel{p}{\to} g(\mu). \] Note that we used the fact that if \(A_n \stackrel{p}{\to} a\) and \(B_n \stackrel{p}{\to} b\), then \(A_n + B_n \stackrel{p}{\to} a + b\) and \(A_n B_n \stackrel{p}{\to} ab\).

Theorem 37.2 is useful for proving consistency of estimators.

Example 37.2 (Consistency of the Exponential MLE) By Theorem 37.2, the MLE \(\hat\lambda = \frac{1}{\bar X}\) converges in probability to \(\frac{1}{\mu} = \frac{1}{1 / \lambda} = \lambda\), so \(\hat\lambda\) is consistent for \(\lambda\).

Consistency means that the estimate should be very close to the truth when \(n\) is large. The code below simulates the sampling distribution of the MLE \(\hat\lambda\) when \(n = 1600\) and \(\lambda = 1.5\).

The estimates nearly always come out to within \(0.1\) of the true value, \(\lambda = 1.5\), as predicted by consistency. Perhaps surprisingly, the sampling distribution also appears to be approximately normal. We explore this phenomenon in the next section.

37.3 Delta Method

Even though \(\bar X\) is approximately normal by the Central Limit Theorem (Theorem 36.2), it may come as a surprise that \(g(\bar X)\) is also approximately normal. We have seen examples where \(Z\) was a normal random variable, but \(g(Z)\) was very far from normal. The difference here is that \(\bar X\) is not only approximately normal; it is also consistent for \(\mu\). These two facts conspire to ensure that \(g(\bar X)\) is also approximately normal.

In this section, we derive the asymptotic distribution of \(g(\bar X)\). To fully appreciate the proof, we will need the following technical lemma, which is optional reading.

Theorem 37.3 (Slutsky’s Theorem) If \(Y_n \stackrel{d}{\to} Y\) and \(A_n \stackrel{p}{\to} a\), then

  • \(Y_n + A_n \stackrel{d}{\to} Y + a\) and
  • \(A_n Y_n \stackrel{d}{\to} aY\).

To derive the asymptotic distribution of \(g(\bar X)\), we essentially do a first-order Taylor expansion: \[ g(\bar X) \approx g(\mu) + g'(\mu) (\bar X - \mu). \] Now, the right-hand side is just a linear transformation of \(\bar X\), which is approximately normal by the Central Limit Theorem (Theorem 36.2). Since linear transformations of a normal distribution are still normal (Definition 22.4), \(g(\bar X)\) should also be approximately normally distributed. This recipe is called the delta method.

Theorem 37.4 (Delta Method) Let \(X_1, \dots, X_n\) be i.i.d. with mean \(\text{E}\!\left[ X_1 \right] = \mu\) and variance \(\text{Var}\!\left[ X_1 \right] = \sigma^2\). Let \(g\) be a differentiable function with \(g'(\mu) \neq 0\). Then, \[ \sqrt{n} (g(\bar{X}) - g(\mu)) \stackrel{d}{\to} \text{Normal}(0, g'(\mu)^2 \sigma^2). \tag{37.5}\]

Proof

We will assume that \(g\) admits a Taylor expansion around \(\mu\): \[ g(\bar X) = g(\mu) + g'(\mu) (\bar X - \mu) + \frac{g''(\mu)}{2} (\bar X - \mu)^2 + \dots. \]

Rearranging terms and multiplying by \(\sqrt{n}\), we obtain \[ \sqrt{n} (g(\bar X) - g(\mu)) = g'(\mu) \sqrt{n} (\bar X - \mu) + \frac{g''(\mu)}{2} (\bar X - \mu) \sqrt{n} (\bar X - \mu) + \dots. \]

The first term on the right-hand side converges in distribution to \(\text{Normal}(0, g'(\mu)^2 \sigma^2)\), since

  • The Central Limit Theorem (Theorem 36.2) says that \(\sqrt{n} (\bar X - \mu) \stackrel{d}{\to} \text{Normal}(0, \sigma^2)\).
  • Multiplying by the constant \(g'(\mu)\) squares the variance. (This can be seen as a trivial application of Theorem 37.3.)

The remaining terms on the right-hand side all converge to \(0\) because they

  • not only contain a factor of \(Y_n = \sqrt{n} (\bar X - \mu)\), which converges in distribution to a normal distribution,
  • but also contain additional factors of \(A_n = \bar X - \mu\), which converge in probability to \(0\).

By Theorem 37.3, their product \(A_n Y_n\) must converge in probability to \(0\).

This leaves only the first term on the right-hand side, which is exactly the limiting distribution in Equation 37.5.

Armed with the delta method, we can determine the asymptotic distribution of the exponential MLE.

Example 37.3 (Asymptotic Distribution of the Exponential MLE) For the exponential distribution, \(\mu = \frac{1}{\lambda}\) and \(\sigma^2 = \frac{1}{\lambda^2}\). Since the MLE is \(\hat\lambda = g(\bar X) = \frac{1}{\bar X}\), \(g'(\mu) = -\frac{1}{\mu^2} = -\lambda^2\).

We can use the delta method (Equation 37.5) to conclude that \[ \sqrt{n}(g(\bar X) - g(\mu)) = \sqrt{n}(\hat\lambda - \lambda) \] is approximately \[ \text{Normal}(0, \lambda^4 \frac{1}{\lambda^2}) = \text{Normal}(0, \lambda^2).\]

Rearranging terms, we see that the distribution of \(\hat\lambda\) is approximately \[ \text{Normal}(\lambda, \frac{\lambda^2}{n}). \]

Let us add this normal curve to the simulation from earlier, where \(n=1600\) and \(\lambda=1.5\). It is a very good approximation!

estimates <- replicate(10000, {
    x <- rexp(1600, rate=1.5)
    1 / mean(x)
  })
hist(estimates, breaks=30, freq=FALSE)
curve(dnorm(x, mean=1.5, sd=sqrt(1.5^2 / 1600)),
      col="red", add=TRUE)