\[ \newcommand{\or}{\textrm{ or }} \newcommand{\and}{\textrm{ and }} \newcommand{\not}{\textrm{not }} \newcommand{\Pois}{\textrm{Poisson}} \newcommand{\E}{\textrm{E}} \newcommand{\var}{\textrm{Var}} \]
\[ \newcommand{\or}{\textrm{ or }} \newcommand{\and}{\textrm{ and }} \newcommand{\not}{\textrm{not }} \newcommand{\Pois}{\textrm{Poisson}} \newcommand{\E}{\textrm{E}} \newcommand{\var}{\textrm{Var}} \]
$$
$$
In Chapter 1, we defined the probability of an event \(A\) to be the ratio of the number of equally likely outcomes in \(A\) to the number of equally likely outcomes in the sample space \(\Omega\). There are two reasons this definition is inadequate. First, for some experiments it is unclear how to make a suitable sample space with equally likely outcomes:
Second, it doesn’t accommodate experiments with infinitely many outcomes:
To handle situations like those in Example 3.1 and Example 3.2, we need a more general definition of probability.
Developing a more general definition of probability requires knowledge of some foundational concepts from set theory. Mainly, we need to be familiar with the basic set operations. In the same way that operations on real numbers (e.g., addition, subtraction, inversion) take real numbers and return a new real number, set operations take events (which are sets) and return a new event.
In this section we review the basic set operations and the difference between countable and uncountable infinities. We use the random babies experiment, recalled below, as an illustrative example throughout.
The intersection \(A \cap B\) of two events \(A\) and \(B\), depicted in Figure 3.1, is the event that both \(A\) and \(B\) happen. Colloquially, we refer to \(A \cap B\) as \(A \textrm{ and }B\).
Sometimes we are interested in the intersection of more than two events.
We refer to a collection of events as disjoint when their intersection is empty (i.e., they share no outcomes). Disjoint events are also referred to as mutually exclusive because no two of them can happen simultaneously. Figure 3.2 depicts two disjoint events.
In the next section, we will see that recognizing when events are disjoint is very important. Our next example describes a collection of three disjoint events.
The union \(A \cup B\) of two events \(A\) and \(B\), depicted in Figure 3.3, is the event either \(A\) happens or \(B\) happens (or both \(A\) and \(B\) happen). Colloquially, we refer to \(A \cup B\) as \(A \textrm{ or }B\).
Like with intersections, we are often interested in the union of more than two events.
When we take the union of disjoint or mutually exclusive events, we sometimes call it a disjoint union.
The complement \(A^c\) of an event \(A\), depicted in Figure 4.2, is the event that happens whenever \(A\) doesn’t. Colloquially, we refer to \(A^c\) as \(\textrm{not }A\).
Lastly, we recall the difference between countable and uncountable infinities. This subtle distinction plays a surprisingly important role in probability theory.
An infinite collection of items is countable if the items can be enumerated in a list and uncountable if not. Example 3.10, which we’ve left as optional reading, gives a few examples of both countable and uncountable infinities.
In 1933, mathematician Andrey Kolmogorov developed an axiomatic definition of probability, which is still the gold-standard definition of probability today. In Kolmogorov’s definition, probabilities are determined by a probability function \(P\). This function assigns a probability \(P(A)\) between zero and one to each event \(A\) in the sample space. To be a valid probability function, \(P\) must satisfy the three Kolmogrov axioms, which we present below.
Unlike Bernoulli’s naive definition of probability, which tells us explicitly how to compute the probability of any event, Kolmogorov’s definition requires that we choose a probability function \(P\) that is suitable for the specific experiment at hand. The below example illustrates the difference between using Bernoulli’s and Kolmogorov’s definitions.
As suggested by Example 3.11, Kolmogorov’s definition is a strict generalization of Bernoulli’s. Theorem 3.1 tells us that when \(\Omega\) has finitely many outcomes, Bernoulli’s naive definition of probability corresponds to just one of the probability functions we could use under Kolmogorov’s definition. Specifically, it corresponds to the unique probability function that makes each outcome equally likely. As such, although we’ll only use Kolmogorov’s definition moving forward, when \(\Omega\) is finite and we suppose that it has equally likely outcomes, doing so is equivalent to using Bernoulli’s. We ask you to prove Theorem 3.1 yourself in Exercise 3.3.
As promised, the added flexibility from Kolmogorov’s definition allows us to deal with the problematic settings that we discussed at the start of the chapter. First, as we’ve already seen in the loaded die example (Example 3.11), we now can allow for outcomes that are not equally likely.
Kolmogorov’s definition also allows us to accommodate experiments with infinitely many outcomes.
As Example 3.13 hints, formally verifying that a probability function \(P\) satisfies all of Kolmogrov’s axioms when \(\Omega\) is uncountably infinite is difficult. For the sake of our book, the level of rigor presented in Example 3.13 will suffice. Henceforth (unless otherwise stated), when we provide you a probability function you may assume that it’s uniquely defined on every event and doesn’t violate any of Kolmogorov’s axioms.
As a final benefit of Kolmogorov’s definition, there are cases where Bernoulli’s definition works for some appropriately chosen sample space, but Kolmogorov’s definition allows for the same analysis with a much simpler sample space. We provide such an example below.
To close out the chapter, we discuss why Kolmogorov’s axiomatic defintion is an appropriate definition of probability. A good axiomatic definition should identify a minimal set of axioms, or fundamental assumptions, about how probabilities should behave. Still, this minimal set should be extensive enough that they imply all the relevant properties that probabilities should have. Whether or not you believe Kolmogorov’s axioms satisfy these criteria depends on what exactly you believe probabilities represent.
In what follows, we give a brief (indeed, a whole book could be written just on this matter!) description of frequentism and Bayesianism, the two most popular philosophies on how to interpret probabilities. We argue that frequentists and Bayesians (1) both agree that probabilities should satisfy Kolmogorov’s axioms, and (2) can both derive the properties they believe probabilities should have from these axioms. An important consequence is, whether you choose to adopt the frequentist philosophy or the Bayesian one, this book will be equally useful to you! The foundational tools that Bayesians and frequentists use to compute probabilities are exactly the same. The only difference is how those probabilities are ultimately interpreted. Each philosophy has its strengths, and different folks have their own opinions about which is better when. We leave it to you to decide when and where to be frequentist versus Bayesian.
For a frequentist to ascribe a probability \(P(A)\) to some event \(A \subseteq \Omega\), a number of conditions must be true. First the require that the experiment can be repeated again and again over many trials. Second, they require that, as the experiment is repeated, the proportion of trials where \(A\) happens approaches some limiting value. Lastly, they require that, throughout the repeated trials, there is no pattern to when \(A\) happens or doesn’t happen. Formalizing precisely what this means is very tricky and requires extensive mathematics, so we’ll instead just give a simple motivating example to get the idea across. If the limiting proportion of trials where \(A\) happens is \(0.5\) and we mark trials where \(A\) happens with \(1\) and trials where it doesn’t with a \(0\), then the results over repeated trials should look something like
\[0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, \dots \]
and not like
\[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, \dots \]
If all these conditions hold, then a frequentist considers the probability of \(A\) to be the proportion of times that \(A\) happens in the limit of infinite repeated trials: \[P(A) = \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } A \text{ happens} }{\text{\# trials}}. \] Equipped with this interpretation, we can justify why a frequentist should believe that probabilities obey Kolmogorov’s three axioms.
No two axioms together imply the third, so they are all truly needed. In the next chapter, we will see that Kolmogorov’s axioms imply many of the properties a frequentist would like probabilities to have, suggesting that the axioms are also sufficiently extensive.
Frequentism is the way most people interpret probabilities, in large part because it provides such an intuitive interpretation. But, it has it’s drawbacks. Particularly, it is not well suited for experiments that are hard to repeat. For example, if we ask the question “what’s the probability that the sun will explode tomorrow?”, can a frequentist give an answer? Tomorrow only happens once, and the sun will either explode or it won’t. The classic workaround is to imagine infinite hypothetical copies of our universe. As the laws of nature play themselves out, the sun will explode tomorrow in a (very, very small) proportion of them. That proportion is what a frequentist’s probability represents. Some find this workaround satisfactory, while others believe that frequentists simply cannot assign probabilities to these sorts of one-off events.
Even when working with seemingly repeatable experiments, frequentism runs into some philosophical issues. Consider, for example, the classic experiment of tossing coin. If we don’t toss the coin exactly the same way, in exactly the same place, are we really repeating the experiment? Is it ever possible to exactly repeat an experiment? Even if we turn to the infinite universe workaround and imagine that the coin is flipped exactly the same way in each universe, classical mechanics tells us that the flip will have the identical outcome in each universe (see Figure 3.6). When repeated exactly, the experiment has no randomness! In the case of coin flipping, as well as many other experiments, the experimental outcome’s sensitivity to imperceivable or unknowable changes in the experiment’s initial conditions is actually what gives the appearance of randomness. Except for experiments whose outcomes vary because of true, fundamental randomness (like randomness at the quantum level, if you believe in that sort of thing), a precise frequentist definition of probability falls apart.
Still, frequentism is an incredibly important and useful philosophy. We liken it to Newton’s theory of gravity. Is it true that there is a force of attraction between objects as Newton suggested? Einstein’s theory of relativity, which makes the same predictions as Netwon’s theory in standard cases but more accurate predictions when supermassive objects are present, says no. Einstein’s theory claims that gravity is a result of objects warping space-time, not attracting one another (although physicists today would point out that Einstein’s theory fails at the quantum level, and is thus also “incorrect”). Still, in appropriate settings, Netwon’s theory of gravity accurately captures how the world behaves, despite being “false”. In fact, it is Newton’s theory that allowed us to land a rocket on the moon. So, is coin flipping truly random? No, it’s not. But we may as well pretend its random, just as we often pretend that objects exert attractive forces on one another. Like Newton’s theory of gravity, frequentism provides a view of the world that, although “incorrect”, helps us make useful predictions and draw accurate inferences when used in the appropriate settings. If we’re studying something that appears to behave randomly in the way that frequentists describe (e.g., coin flips), then frequentism is an apt and useful viewpoint to adopt. It’s up to us to determine when that is and isn’t the case.
Unlike frequentists, Bayesians consider the probability \(P(A)\) to be a degree of belief about whether the event \(A\) will happen or not. One way of making this concrete is relating beliefs to betting: a person’s degree of belief determines whether or not they are willing to buy/sell certain wagers. Let’s review what it means to buy/sell a wager. If I buy a \(x\) dollar wager on \(A\) happening that pays \(y\) dollars, that means I profit \(y\) dollars if \(A\) happens and lose \(x\) dollars if it doesn’t. Selling a wager corresponds to taking the other side of the bet. If I sell a \(x\) dollar wager on \(A\) happening that pays \(y\) dollars, then if \(A\) happens I lose \(y\) dollars and if it doesn’t I profit \(x\) dollars.
Now, suppose I think there’s \(50\)% chance of \(A\) happening (i.e., I believe that \(P(A) = 0.5\)). That means I believe it’s equally likely for \(A\) to happen or not happen. To a Bayesian this has nothing to do with frequencies. It simply means that I’ll accept 1-to-1 wagers on \(A\) happening versus it not happening (or wagers that are even more favorable to me). Specifically, I’d buy a \(\$1\) wager on \(A\) happening so long as it pays at least \(\$1\). If it pays less than \(\$1\), I won’t buy it. I’d also sell a \(\$1\) wager on \(A\) happening that pays at most \(\$1\), but not one that pays more than \(\$1\).
If I instead assign the probability \(P(A) = 0.2\) to \(A\), then the wagers I’m willing to buy/sell change. I now believe the event is four times less likely to happen than not. Concretely, I’ll be willing buy a \(\$1\) wager on \(A\) happening so long as it pays at least \(\$4\), and sell a \(\$1\) wager on \(A\) happening so long as it pays at most \(\$4\). In this view, the probability of an event simply determines the tipping point at which a person is willing to buy/sell different wagers on the event happening.
As you may expect, the probabilities people assign to events should satisfy some properties. If they don’t, that person can be easily exploited.
We say that our friend in Example 3.15 is susceptible to arbitrage: there is a finite collection of wagers that they’re willing to buy/sell that gaurantee them a loss.
Theorem 3.2, which we state without proof, tells us that you can avoid being arbitraged if you ensure that your belief’s satisfy Kolmogrov’s axioms.
As Theorem 3.2 illustrates, Bayesianism comes with a nice and clean philosophical justification. As an added bonus, Bayesianism also allows us to assign probabilities to any event (unlike frequentism). But the variety of Bayesianism we’ve presented comes with one big downside: it is completely subjective. If you, as a federal regulatory agency, ask two different subjective Bayesian experts for the probability that a drug will have fatal side effects, they may give you two vastly different probabilities. Both of them, however, will be able to argue that their judgements are rational. Frequentism, on the other hand, is built around something that is more objective (albeit philosophically questionable): limiting frequencies. There are variations of the Bayesian philosophy that are less subjective, but they come with their own criticisms, and we do not cover them in this book.
Exercise 3.1 Coming soon!
Exercise 3.2 Coming soon!
Exercise 3.3 Coming soon!
Exercise 3.4 Coming Soon!
Exercise 3.5 Coming Soon!