28 Probability versus Statistics
Although probability is a fascinating subject in its own right, it is perhaps most important in the modern world because of its application to statistics. In this chapter, we discuss the intricate relationship between probability and statistics.
28.1 Mark-Recapture
How do we estimate the size of a population, such as the number of a snails in a state park? There may be too many to count, and it may be difficult to catch them all. Another way is mark-recapture. First, we capture a sample, say \(50\) snails, and βmarkβ them, as shown at right.
Then, after some time, we recapture another sample of snails, say \(40\). Some of these snails will be marked from the first capture, while others are unmarked. The number of recaptured snails that are marked can be used to estimate the number of snails in the population.
In Chapter 12, we learned to solve probability problems like the following.
But Example 28.1 is not realistic. If we knew that there were \(300\) snails in the population, we would not bother doing mark-recapture in the first place!
Here is a more realistic scenario, where we have collected data and want to infer something about the population.
In Example 28.1, we assumed that \(s = 300\) and wanted to calculate the probabilities of various values of \(X\), whereas in Example 28.2, we observe \(X = 11\) and want to estimate \(s\). In other words, statistics is the inverse of probability. This idea is illustrated in Figure 28.1.
Properties of the population (or model), such as \(s\), are called parameters, while properties of the sample (or data), such as \(X\), are called statistics. How do we estimate a parameter using data? The probability distribution still plays an important role. However, the unknown quantity is now the parameter \(s\), instead of the value of the random variable \(X\). This motivates the following definition:
Let us determine the likelihood for the mark-recapture problem.
How do we use the likelihood to estimate \(s\)? Since the likelihood represents the probability of observing the data, one idea is to choose \(s\) to make this probability as large as possible. This principle is stated below.
To find the MLE for the mark-recapture problem, we find the value of \(s\) that maximizes \(L_{11}(s)\). From Example 28.3, we see that the likelihood is maximized somewhere between 150 and 200. To determine the exact value, we print out the likelihood for all values of \(s\) between 150 and 200.
We see that the likelihood is maximized at \(s = 181\), where it achieves a maximum value of \(0.15854\). Therefore, the MLE for the size of the snail population is \(\hat s = 181\). This value makes intuitive sense. The data suggests that approximately \(11 / 40 = 0.275\) of all snails are marked. Since we marked \(50\) snails, the number of snails in the population should be \(50 / 0.275 \approx 181.82\), which is very close to the MLE.
This is no accident. We can derive the MLE as a function of the number of marked snails \(M\), the number of snails in the second sample \(n\), and and the number of marked snails \(x\). To do this, we consider the ratio \(L_x(s) / L_x(s - 1)\):
\[ \begin{align} \frac{L_x(s)}{L_x(s - 1)} &= \frac{\frac{\binom{M}{x} \binom{s - M}{n - x}}{\binom{s}{n}}}{\frac{\binom{M}{x} \binom{s - 1 - M}{n - x}}{\binom{s - 1}{n}}} \\ &= \frac{(s - n)(s - M)}{s(s - M - n + x)}. \end{align} \]
The likelihood is increasing if and only if this ratio is greater than \(1\)βthat is, when \[ \begin{align*} (s - n)(s - M) &> s(s - M - n + x) \\ s^2 - sM - sn + nM &> s^2 - sM - sn + sx \\ nM &> sx. \end{align*} \] In other words, the likelihood will increase as long as \(s < \frac{nM}{x}\), and it will decrease when \(s > \frac{nM}{x}\). Therefore, the likelihood is maximized when \(s\) is the greatest integer not exceeding \(\frac{nM}{x}\): \[ \hat s = \left\lfloor \frac{nM}{x} \right\rfloor. \]
This captures the intuition that the best estimate of the population size is the value of \(s\) that makes \(\frac{M}{s} \approx \frac{x}{n}\).
28.2 Skew Dice
A skew die is one whose faces are irregular. Are skew dice fair? One way to find out is to roll the die and collect some data.
You should already be familiar with how to solve problems like the following.
But the whole point of rolling the die is to determine the probability of landing on each face. That is, the statistics question is likely more compelling than the probability question.
The MLE in Example 28.5 is intuitive. If the skew die landed on six \(7\) times in \(25\) rolls, then our best estimate for the probability of landing on six is \(\frac{7}{25}\). We can show this fact more generally, by replacing \(25\) by \(n\) and \(7\) by \(x\).
The likelihood of \(p\) is \[ L_x(p) = \binom{n}{x} p^x (1 - p)^{n - x}. \] Taking the derivative again with respect to \(p\), we obtain \[ \begin{align} 0 &= \frac{\partial}{\partial p} L_x(p) \\ &= \frac{\partial}{\partial p} \binom{n}{x} p^x (1 - p)^{n - x} \\ &= \binom{n}{x} \big(x p^{x - 1} (1 - p)^{n - x} - (n - x) p^x (1 - p)^{n - x - 1}\big) \\ &= \binom{n}{x} p^{x - 1} (1 - p)^{n - x - 1} \big(x (1 - p) - (n - x) p\big). \end{align} \] The solution to this equation corresponding to a maximum is \[ \hat p = \frac{x}{n}. \]