3  Axiomatic Definition of Probability

In Chapter 1, we defined the probability of an event \(A\) to be the ratio of the number of equally likely outcomes in \(A\) to the number of equally likely outcomes in the sample space \(\Omega\). There are two reasons this definition is inadequate. First, for some experiments it is unclear how to make a suitable sample space with equally likely outcomes:

Example 3.1 (Predicting the sex of the next born baby) What’s the probability that the next baby that’s conceived and born into the world will be male? One reasonable sample space for this “experiment” is \(\Omega = \{\text{male}, \text{female} \}\). Or perhaps we could make the sample space \(\Omega = \{XY, XY\}\), where each outcome corresponds to the sex chromosomes passed down from the babies’ parents. Either way, the naive definition of probability (Definition 1.4) gives a probability of \(0.5\). In reality, we know that the probability is closer to \(0.51\). Although about \(50\%\) of unborn fetuses are male at conception, male fetuses tend to experience fewer complications than female fetuses during pregnancy. Because slightly more female fetuses don’t make it through pregnancy, more male babies are born than female ones. The difference is small, but the naive definition of probability cannot account for it, and it gives an incorrect answer.

Second, it doesn’t accommodate experiments with infinitely many outcomes:

Example 3.2 (Throwing a dart at a dartboard) Suppose a complete novice is throwing a dart at a dartboard, and we want to know the probability that they will hit the inner bullseye. As simple as this example is, it poses serious challenges for our naive definition of probability. We depict the dartboard, drawn in the \(xy\)-plane, below.

For simplicity we’ll suppose that the thrower hits the board, so the throw’s possible outcomes are the \((x, y)\) pairs that make up the dartboard. Considering that the thrower is a complete novice, we may reasonably assume that all the outcomes are equally likely. Unfortunately, there are infinitely may outcomes in both the sample space \(\Omega\) and the event \(A\) of hitting the inner bullseye. Trying to apply the naive definition of probability (Definition 1.4), we get the very unhelpful answer that \(P(A) = |A|/|\Omega| = \infty / \infty\).

To handle situations like those in Example 3.1 and Example 3.2, we need a more general definition of probability.

3.1 Set Theory Background

Developing a more general definition of probability requires knowledge of some foundational concepts from set theory. Mainly, we need to be familiar with the basic set operations. In the same way that operations on real numbers (e.g., addition, subtraction, inversion) take real numbers and return a new real number, set operations take events (which are sets) and return a new event.

In this section we review the basic set operations and the difference between countable and uncountable infinities. We use the random babies experiment, recalled below, as an illustrative example throughout.

Example 3.3 (Returning random babies) Recall the generalized random babies experiment (Example 2.4) from Chapter 2, where we randomly return \(n\) babies to \(n\) sets of parents. Imagine ordering the parents and babies so that the \(i\)th baby belongs to the \(i\)th set of parents, and let \(E_i\) be the event that the \(i\)th couple gets their baby back. Labelling the couple that receives the \(i\)th baby as \(\omega_i\), we denote the experiment’s outcomes as \(\omega = (\omega_1, \dots, \omega_n)\). For example, when \(n=3\), the outcome \((2, 1, 3)\) corresponds to the third couple getting their baby back and the first two couples swapping babies. This outcome is therefore in \(E_3\), but not in \(E_1\) or \(E_2\).

3.1.1 Intersections and Mutually Exclusive Events

The intersection \(A \cap B\) of two events \(A\) and \(B\), depicted in Figure 3.1, is the event that both \(A\) and \(B\) happen. Colloquially, we refer to \(A \cap B\) as \(A \textrm{ and }B\).

Figure 3.1: Two events \(A\) and \(B\) in a sample space \(\Omega\) with their intersection \(A \cap B\) marked in blue.

Example 3.4 (First two babies returned correctly) The event \(E_1 \cap E_2\) happens when both the first and second babies are returned to their biological parents. It is made up of all outcomes \(\omega = (\omega_1, \dots, \omega_n)\) such that \(\omega_1 = 1\) and \(\omega_2 = 2\).

Sometimes we are interested in the intersection of more than two events.

Definition 3.1 (Intersection) The intersection of a finite collection \(A_1, \dots, A_n\) of events, denoted by \(A_1 \cap \dots \cap A_n\) or \(\bigcap_{i=1}^n A_i\), is the event consisting of outcomes that are in all of the \(A_i\). The intersection of an infinite collection \(A_1, A_2, \dots\) of events, denoted by \(A_1 \cap A_2 \cap \dots\) or \(\bigcap_{i=1}^{\infty} A_i\), is defined identically.

Example 3.5 (All babies returned correctly) The event \(\bigcap_{i=1}^n E_i\) is the event that all the babies are returned to their biological parents. It contains the singular outcome \(\omega =(1, 2, \dots, n)\).

We refer to a collection of events as disjoint when their intersection is empty (i.e., they share no outcomes). Disjoint events are also referred to as mutually exclusive because no two of them can happen simultaneously. Figure 3.2 depicts two disjoint events.

Figure 3.2: Two mutually exlusive events \(A\) and \(B\) in a sample space \(\Omega\).

In the next section, we will see that recognizing when events are disjoint is very important. Our next example describes a collection of three disjoint events.

Example 3.6 (Mutually exclusive events) The three events

  1. \(E_1^c \cap E_2 \cap E_3\): first baby returned incorrectly and second and third babies returned correctly,

  2. \(E_1 \cap E_2^c \cap E_3\): second baby returned incorrectly and first and third babies returned correctly,

  3. \(E_1 \cap E_2 \cap E_3^c\): third baby returned incorrectly and first and second babies returned correctly,

are mutually exclusive. Any one of these events requires incorrectly returning a baby that is returned correctly in the others, so no two of them can happen at the same time.

3.1.2 Unions and Disjoint Unions

The union \(A \cup B\) of two events \(A\) and \(B\), depicted in Figure 3.3, is the event either \(A\) happens or \(B\) happens (or both \(A\) and \(B\) happen). Colloquially, we refer to \(A \cup B\) as \(A \textrm{ or }B\).

Figure 3.3: Two events \(A\) and \(B\) in a sample space \(\Omega\) with their union \(A \cup B\) marked in blue.

Example 3.7 (At least one of first two babies returned correctly) The event \(E_1 \cup E_2\) is the event that at least one of the first two babies is returned to their biological parents. It is made up of outcomes \(\omega = (\omega_1, \dots, \omega_n)\) where either \(\omega_1= 1\) or \(\omega_2=2\) (or both \(\omega_1= 1\) and \(\omega_2=2\)).

Like with intersections, we are often interested in the union of more than two events.

Definition 3.2 (Union of many events) The union of a finite collection \(A_1, \dots, A_n\) of events, denoted by \(A_1 \cup \dots \cup A_n\) or \(\bigcup_{i=1}^n A_i\), is the the event consisting of outcomes that are in at least one of the \(A_i\). The union of an infinite collection \(A_1, A_2, \dots\) of events, denoted by \(A_1 \cup A_2 \cup \dots\) or \(\bigcup_{i=1}^{\infty} A_i\), is defined identically.

Example 3.8 (At least one baby returned correctly) The event \(\bigcup_{i=1}^n E_i\) is the event that at least one of the babies is returned to their biological parents. It is made up of outcomes \(\omega = (\omega_1, \dots, \omega_n)\) where \(\omega_i = i\) for at least one \(i \in \{1, \dots, n\}\).

When we take the union of disjoint or mutually exclusive events, we sometimes call it a disjoint union.

3.1.3 Complements

The complement \(A^c\) of an event \(A\), depicted in Figure 4.2, is the event that happens whenever \(A\) doesn’t. Colloquially, we refer to \(A^c\) as \(\textrm{not }A\).

Figure 3.4: An event \(A\) in a sample space \(\Omega\) with its complement \(A^c\) marked in blue.

Definition 3.3 (Complement of an event) The complement of an event \(A\), denoted by \(A^c\), is the event consisting of all the outcomes that are not in \(A\).

Example 3.9 (Incorrectly returning the first baby) The complement \(E_1^c\) of the event \(E_1\) is the event that the first baby is returned incorrectly. It consists of all the outcomes \(\omega = (\omega_1, \dots, \omega_n)\) with \(\omega_1 \neq 1\).

3.1.4 Countable and Uncountable Infinities

Lastly, we recall the difference between countable and uncountable infinities. This subtle distinction plays a surprisingly important role in probability theory.

An infinite collection of items is countable if the items can be enumerated in a list and uncountable if not. Example 3.10, which we’ve left as optional reading, gives a few examples of both countable and uncountable infinities.

Example 3.10 (Countable and uncountable infinities) An obvious example of a countably infinite collection of items is the natural numbers \(\mathbb{N} = \{1, 2, 3, \dots\}\). As written, we see that they are already enumerated in a list. The set of integers \(\mathbb{Z}\) is also countable. To see this, consider the list \[0, 1, -1, 2, -2, 3, -3, \dots\] In Exercise 3.5, we ask you to argue that the rational numbers \(\mathbb{Q}\) (the set of all fractions) are also countable.

A classic example of an uncountably infinite collection of items is the set of infinite binary strings. To show that this collection is uncountable, we must prove that it cannot be enumerated in a list. For the interested reader, we show how do this via Cantor’s diagonlization argument.

Suppose for the sake of contraction that we could enumerate all infinite binary strings in a list:

\[\begin{matrix} \textbf{1.} & 0 & 1 & 0 & 1 & 0 & \dots \\ \textbf{2.} & 1 & 1 & 0 & 1 & 1 & \dots \\ \textbf{3.} & 1 & 1 & 0 & 0 & 0 & \dots \\ \textbf{4.} & 0 & 0 & 1 & 0 & 1 & \dots \\ \textbf{5.} & 1 & 1 & 1 & 1 & 0 & \dots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{matrix}\]

Consider the infinite binary string whose \(i\)th entry matches that of the \(i\)th binary string in our list (i.e., it is given by the diagonal of our list). In our example, this string is \(01000\dots\). Make a new string from this one by flipping every \(0\) to a \(1\) and \(1\) to a \(0\). In our example, we get \(10111 \dots\). Since we’ve enumerated all infinite binary strings, this new string must be the \(j\)th string in our list for some positive integer \(j\). But, by how we’ve constructed this new string, its \(j\)th entry cannot match that of the \(j\)th string in our list, and therefore we have a contradiction.

In Exercise 3.5, we ask you to use the same argument to show that \([0, 1]\), the set of real numbers between zero and one, is also uncountable. As a consequence, sets like the real line \(\mathbb{R}\) and the \(xy\)-plane \(\mathbb{R}^2\) are uncountable as well.

3.2 The Axioms of Probability

Figure 3.5: Andrey Kolmogorov

Andrey Kolmogorov (1903-1987) was a Soviet mathematician who is best known for his contributions to probability theory. His 1933 book Foundations of the Theory of Probability laid mathematical foundations for the field, and he is considered the inventor of modern probability theory.

Outside of probability theory, Kolmogorov also helped found the field of algorithmic information theory. Kolmogorov complexity, the focus of the field’s study, is named after him.

In 1933, mathematician Andrey Kolmogorov developed an axiomatic definition of probability, which is still the gold-standard definition of probability today. In Kolmogorov’s definition, probabilities are determined by a probability function \(P\). This function assigns a probability \(P(A)\) between zero and one to each event \(A\) in the sample space. To be a valid probability function, \(P\) must satisfy the three Kolmogrov axioms, which we present below.

Definition 3.4 (Axiomatic definition of probability) Consider a sample space \(\Omega\) and function \(P\) that maps events \(A \subseteq \Omega\) to real numbers. We say that \(P\) is a probability function if it satisfies the following three axioms:

  1. Non-negativity: Probabilities are never negative: \(P(A) \geq 0\).

  2. Normalization: The probability of the entire sample space is one and the probability of the empty set is zero,

\[ P(\Omega) = 1 \quad \text{and} \quad P(\emptyset) = 0. \]

  1. Countable-additivity: If \(A_1, A_2, A_3, …\) are disjoint sets (mutually exclusive events), then

\[ P\left( \bigcup_{i=1}^{\infty} A_i \right) = \sum_{i=1}^{\infty} P(A_i). \]

Finite additivity

Axiom 3 also implies finite-additivity, i.e., if \(A_1, \dots, A_n\) are disjoint sets (mutually exclusive events) then \(P(\bigcup_{i=1}^{n} A_i) = \sum_{i=1}^{n} P(A_i)\). We ask you to show this in Exercise 3.1.

Admittedly, this definition elides a small mathematical technicality. Sometimes, when the sample space \(\Omega\) is uncountably infinite, there are some events that we cannot assign probabilities to. These events are never of any practical significance, so in this book we ignore them and pretend that we can always assign probabilities to every event. A fully rigorous treatment of Kolmogorov’s axiomatic definition is beyond the scope of this book and typically not covered until graduate-level studies in probability theory.

Unlike Bernoulli’s naive definition of probability, which tells us explicitly how to compute the probability of any event, Kolmogorov’s definition requires that we choose a probability function \(P\) that is suitable for the specific experiment at hand. The below example illustrates the difference between using Bernoulli’s and Kolmogorov’s definitions.

Example 3.11 (Rolling a loaded die) Suppose a gambler adds more weight to the face of a six-sided die that’s opposite the six, thereby making the six slightly more likely to come up. The possible outcomes of the die roll are \(\Omega = \{1, 2, 3, 4, 5, 6\}\). If we used Bernoulli’s naive definition of probability, the probability of rolling any particular number would need to be \(1/6 \approx .167\), which we know is incorrect.

Instead, Kolmogorov’s definition allows us to choose a probability function \(P\) that is more suitable for this particular setting. We could, for example, opt to select the probability function \(P\) that assigns probabilities

\[\begin{align*} & P(\{1\}) = 0.10, \ \ P(\{2\}) = 0.15 , \ \ P(\{3\}) = 0.15, \\ & P(\{4\}) = 0.15, \ \ P(\{5\}) = 0.15, \ \ P(\{6\}) = 0.30,\\ \end{align*}\]

to the different individual outcomes. Because these probabilities sum to one, Exercise 3.2 tells us that a unique such probability function exists. We point out that for \(P\) to satisfy Kolmogorov’s axioms, it is crucial that these probabilities sum to exactly one:

\[\begin{align*} 1 &= P(\Omega) & \text{(Axiom 2)}\\ &= P(\{1\} \cup \dots \cup \{6\}) & \text{(write } \Omega \text{ as a disjoint union)}\\ &= P(\{1\}) + \dots + P(\{6\}) & \text{(Axiom 3)} \end{align*}\]

Under Kolmogorov’s definition, we can compute the probability of rolling an even number by writing the event of rolling an even number as the disjoint union \(\{2\} \cup \{4\} \cup \{6\}\) and then applying Axiom 3 (see Definition 3.4): \[ \begin{align*} P(\text{roll an even number}) &= P(\{2\} \cup \{4\} \cup \{6\}) \\ &= P(\{2\}) + P(\{4\}) + P(\{6\}) & \text{(Axiom 3)} \\ &= 0.15 + 0.15 + 0.3 & \text{(choice of } P \text{)} \\ &= 0.6. & \text{(simplify)} \end{align*} \] Similar reasoning allows us to compute the probability of any event under \(P\).

Bernoulli’s definition would have required that the probability of rolling an even number be \(0.5\), which is too low given the nature of the gambler’s die.

As suggested by Example 3.11, Kolmogorov’s definition is a strict generalization of Bernoulli’s. Theorem 3.1 tells us that when \(\Omega\) has finitely many outcomes, Bernoulli’s naive definition of probability corresponds to just one of the probability functions we could use under Kolmogorov’s definition. Specifically, it corresponds to the unique probability function that makes each outcome equally likely. As such, although we’ll only use Kolmogorov’s definition moving forward, when \(\Omega\) is finite and we suppose that it has equally likely outcomes, doing so is equivalent to using Bernoulli’s. We ask you to prove Theorem 3.1 yourself in Exercise 3.3.

Theorem 3.1 (Naive definition of probability is a probability function) Consider a sample space \(\Omega\) with finitely many outcomes. The function \[P(A) = |A|/|\Omega| \] is not only a probability function, but it is the only probability function that assigns equal probabilities to all events containing just a single outcome. Furthermore, the probability it assigns to these events is \(1/|\Omega|\).

As promised, the added flexibility from Kolmogorov’s definition allows us to deal with the problematic settings that we discussed at the start of the chapter. First, as we’ve already seen in the loaded die example (Example 3.11), we now can allow for outcomes that are not equally likely.

Example 3.12 (Predicting the sex of the next born baby) What’s the probability that the next baby that’s conceived and born into the world will be male? Considering the outcomes \(\Omega = \{\text{male}, \text{female}\}\), we can use the following probability function \(P\), which we define explicitly on each subset of \(\Omega\): \[ P(\{\text{male}, \text{female}\}) = 1, \ \ P(\{\text{male} \}) = 0.51, , \ \ P(\{\text{female} \}) = 0.49, \ \ P(\emptyset) = 0. \]

It’s not hard to verify that \(P\) is indeed a valid probability function in the sense of Definition 3.4 (see Exercise 3.2). Unlike our first attempt (Example 3.1), \(P\) provides probabilities that are reflective of reality.

Kolmogorov’s definition also allows us to accommodate experiments with infinitely many outcomes.

Example 3.13 (Throwing a dart at a dartboard) Let’s revisit the example of a complete novice throwing a dart at a dartboard. Consider regions like the \(\textcolor{purple}{\text{purple}}\) and \(\textcolor{orange}{\text{orange}}\) ones we’ve drawn in the top right and bottom left corner respectively:

We’ll use a probability function \(P\) such that the probability of a dart landing in some region proportional to the region’s area, e.g.,

\[\begin{align*} &P(\text{dart lands in \textcolor{purple}{purple} region}) = \frac{\text{area of \textcolor{purple}{purple} region}}{\text{area of dartboard}}, \\ &P(\text{dart lands in \textcolor{orange}{orange} region}) = \frac{\text{area of \textcolor{orange}{orange} region}}{\text{area of dartboard}}. \end{align*}\]

This corresponds to all the outcomes being “equally likely” (there’s no preference for a particular part of the dart board). We give intuition, but not a formal proof, for why \(P\) should satisfy Definition 3.4’s axioms:

  1. Axiom 1: Areas are never negative, so our probabilities will all be non-negative.

  2. Axiom 2: The probability of the whole sample space is given by

\[ P(\Omega) = \frac{\text{area of dartboard}}{\text{area of dartboard}} = 1. \]

  1. Axiom 3: Because the regions have no overlap, the event that the dart hits the \(\textcolor{orange}{\text{orange}}\) region and the event that it hits the \(\textcolor{purple}{\text{purple}}\) region are disjoint. In line with this,

\[\begin{align*} &P(\text{dart lands in \textcolor{orange}{orange} region} \textrm{ or }\text{dart lands in \textcolor{purple}{purple} region}) & \\ &P(\text{dart lands in region covered by \textcolor{orange}{orange} and/or \textcolor{purple}{purple}}) & (\text{same event})\\ &= \frac{\text{area of region covered by \textcolor{orange}{orange} and/or \textcolor{purple}{purple} }}{\text{area of dartboard}} & \text{(choice of } P \text{)}\\ &= \frac{\text{area of \textcolor{orange}{orange} region}}{\text{area of dartboard}} + \frac{\text{area of \textcolor{purple}{purple} region}}{\text{area of dartboard}} & \text{(no overlap)}\\ &P(\text{dart lands in \textcolor{orange}{orange} region}) + P(\text{dart lands in \textcolor{purple}{purple} region}). & \text{(choice of } P \text{)} \end{align*}\]

Equipped with \(P\), we can compute the probability that the novice thrower will hit the inner bullseye. A standard dartboard has a radius of \(8.875\) (inches) and an inner bullseye radius of \(0.25\) (inches). The probability of the dart hitting the inner bullseye is therefore \[P(\text{inner bullseye}) = \frac{\pi \cdot (0.25)^2}{\pi \cdot 8.875^2} \approx 0.0008,\] or just under a tenth of a percent.

Interestingly, many events have zero probability under this probability function. For example, the probability of the dart hitting somewhere exactly on the \(x\)-axis is zero, because the area of any one dimensional line is zero. Considering that this line is infinitely thin, it may not seem unreasonable that the dart will never exactly hit it. But even further, the same is true for the probability of hitting any specific \((x, y)\) point (e.g., the origin). This may seem surprising or even incorrect, because we know the dart is definitely going to land at some \((x, y)\) point. Still, the probability that it will land at any specific \((x, y)\) point is zero. Although uncomfortable to grapple with, this is an unavoidable and fundamental fact about probability theory!

A faulty argument

Many people initially feel that the probability of the dart hitting a specific \((x, y)\) point being zero must lead to a contradiction. They give the following faulty argument: The sample space \(\Omega\) can be written as the union of all the different events \(\{(x, y)\}\) that contain exactly one \((x, y)\) point in \(\Omega\). Since these events are disjoint, Axiom 3 tells us that \(P(\Omega)\) is the sum of the probabilities of these events. These events all have probability zero, so we must have \(P(\Omega) = 0\). This contradicts Axiom 2’s requirement that \(P(\Omega) = 1\). This argument is false, however, because there are an uncountably infinite number of \((x, y)\) points in \(\Omega\) (see Exercise 3.5). Axiom 3 only applies for countably infinite collections of disjoint sets, and we can assure you that no such contradiction exists!

As Example 3.13 hints, formally verifying that a probability function \(P\) satisfies all of Kolmogrov’s axioms when \(\Omega\) is uncountably infinite is difficult. For the sake of our book, the level of rigor presented in Example 3.13 will suffice. Henceforth (unless otherwise stated), when we provide you a probability function you may assume that it’s uniquely defined on every event and doesn’t violate any of Kolmogorov’s axioms.

As a final benefit of Kolmogorov’s definition, there are cases where Bernoulli’s definition works for some appropriately chosen sample space, but Kolmogorov’s definition allows for the same analysis with a much simpler sample space. We provide such an example below.

Example 3.14 (Randomly chosen blood type) Suppose we want to know the probability that a uniformly randomly selected person from the United States (each person is equally likely to be selected) has a specific blood type (e.g., \(O-\)). Letting \(N\) be the number of people in the United States, we imagine using the sample space \(\Omega = \{1, \dots, N \}\) where each outcome acts as an ID number for the person we select. With this sample space, Bernoulli’s definition will give reasonable probabilities. But the event of selecting someone with \(O-\) blood is, technically speaking, the set of IDs of people that have \(O-\) blood. This is perhaps a bit overly complicated.

If we instead use Kolmogorov’s definition with the sample space \[ \Omega = \{O+ ,O-, A+, A-, B+, B-, AB+, AB-\} \] whose outcomes are the different blood types, then the probability function \(P\) that assigns probabilities \[\begin{align*} & P(\{O+\}) = 0.38, \ \ P(\{O-\}) = 0.07, \ \ P(\{A+\}) = 0.34, \ \ P(\{A-\}) = 0.06\\ & P(\{B+\}) = 0.09, \ \ P(\{B-\}) = 0.02, \ \ P(\{AB+\}) = 0.03, \ \ P(\{AB-\}) = 0.01\\ \end{align*}\] to each individual blood type will also give probabilities that are reflective of reality. Now, the event of selecting someone with \(O-\) blood is simply \(\{O-\}\). Given the initial question we posed, this set-up feels simpler and more informative.

3.3 What Does Probability Mean?

To close out the chapter, we discuss why Kolmogorov’s axiomatic defintion is an appropriate definition of probability. A good axiomatic definition should identify a minimal set of axioms, or fundamental assumptions, about how probabilities should behave. Still, this minimal set should be extensive enough that they imply all the relevant properties that probabilities should have. Whether or not you believe Kolmogorov’s axioms satisfy these criteria depends on what exactly you believe probabilities represent.

In what follows, we give a brief (indeed, a whole book could be written just on this matter!) description of frequentism and Bayesianism, the two most popular philosophies on how to interpret probabilities. We argue that frequentists and Bayesians (1) both agree that probabilities should satisfy Kolmogorov’s axioms, and (2) can both derive the properties they believe probabilities should have from these axioms. An important consequence is, whether you choose to adopt the frequentist philosophy or the Bayesian one, this book will be equally useful to you! The foundational tools that Bayesians and frequentists use to compute probabilities are exactly the same. The only difference is how those probabilities are ultimately interpreted. Each philosophy has its strengths, and different folks have their own opinions about which is better when. We leave it to you to decide when and where to be frequentist versus Bayesian.

3.3.1 Frequentism

For a frequentist to ascribe a probability \(P(A)\) to some event \(A \subseteq \Omega\), a number of conditions must be true. First the require that the experiment can be repeated again and again over many trials. Second, they require that, as the experiment is repeated, the proportion of trials where \(A\) happens approaches some limiting value. Lastly, they require that, throughout the repeated trials, there is no pattern to when \(A\) happens or doesn’t happen. Formalizing precisely what this means is very tricky and requires extensive mathematics, so we’ll instead just give a simple motivating example to get the idea across. If the limiting proportion of trials where \(A\) happens is \(0.5\) and we mark trials where \(A\) happens with \(1\) and trials where it doesn’t with a \(0\), then the results over repeated trials should look something like

\[0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, \dots \]

and not like

\[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, \dots \]

If all these conditions hold, then a frequentist considers the probability of \(A\) to be the proportion of times that \(A\) happens in the limit of infinite repeated trials: \[P(A) = \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } A \text{ happens} }{\text{\# trials}}. \] Equipped with this interpretation, we can justify why a frequentist should believe that probabilities obey Kolmogorov’s three axioms.

  1. Axiom 1: Since \(P(A)\) is the limiting value of a frequency, it must be non-negative.
  2. Axiom 2: The event \(\Omega\) happens on every trial, so \[P(\Omega) = \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } \Omega \text{ happens} }{\text{\# trials}} = \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials}}{\text{\# trials}} = \lim_{\text{\# trials} \rightarrow \infty} 1 = 1.\]
  3. Axiom 3: Consider two disjoint events \(\textcolor{blue}{A}\) and \(\textcolor{red}{B}\). Since the events never happen at the same time, we can mark trials where \(\textcolor{blue}{A}\) happens with \(\textcolor{blue}{1}\), \(\textcolor{red}{B}\) happens with \(\textcolor{red}{1}\), and neither happens with \(0\). The results from repeated trials will look something like \[\textcolor{blue}{1}, \textcolor{blue}{1}, 0, 0, \textcolor{red}{1}, \textcolor{blue}{1}, \textcolor{red}{1}, \textcolor{red}{1}, \textcolor{blue}{1}, 0, 0, \textcolor{blue}{1}, 0, \textcolor{red}{1}, \textcolor{blue}{1}, 0, 0, \textcolor{blue}{1}, 0, 0, 0, \textcolor{blue}{1}, 0, 0, \dots \] Therefore,

\[ \begin{align*} &P(\textcolor{blue}{A} \cup \textcolor{red}{B}) \\ &= \lim_{\text{\# trials} \rightarrow \infty} \frac{\text{\# trials where } \textcolor{blue}{A} \text{ or } \textcolor{red}{B} \text{ happens} }{\text{\# trials}} & \text{(frequentist interpretation)}\\ &= \lim_{\text{\# tr.} \rightarrow \infty} \left(\frac{\text{\# tr. where } \textcolor{blue}{A} \text{ happens} }{\text{\# trials}} + \frac{\text{\# tr. where } \textcolor{red}{B} \text{ happens} }{\text{\# trials}} \right) & \text{(disjoint events)} \\ &= \lim_{\text{\# tr.} \rightarrow \infty} \frac{\text{\# tr. } \textcolor{blue}{A} \text{ happens} }{\text{\# trials}}+\lim_{\text{\# tr.} \rightarrow \infty}\frac{\text{\# tr. } \textcolor{red}{B} \text{ happens} }{\text{\# trials}} & \hspace{-100pt} \text{(property of limits)} \\ &= P(\textcolor{blue}{A}) + P(\textcolor{red}{B}) & \text{(frequentist interpretation)} \end{align*} \]

In our argument for Axiom 3 we only considered two disjoint sets, but the same argument applies for an arbitrary finite number of them. This, however, only justifies the need for finite additivity, not countable additivity as is required by Axiom 3 (see the note in Definition 3.4 for a reminder on the difference between finite and countable additivity). It turns out countably additivity is needed to prove many desirable properties of probabilities, and we will ourselves make use of countable additivity later on.

No two axioms together imply the third, so they are all truly needed. In the next chapter, we will see that Kolmogorov’s axioms imply many of the properties a frequentist would like probabilities to have, suggesting that the axioms are also sufficiently extensive.

Frequentism is the way most people interpret probabilities, in large part because it provides such an intuitive interpretation. But, it has it’s drawbacks. Particularly, it is not well suited for experiments that are hard to repeat. For example, if we ask the question “what’s the probability that the sun will explode tomorrow?”, can a frequentist give an answer? Tomorrow only happens once, and the sun will either explode or it won’t. The classic workaround is to imagine infinite hypothetical copies of our universe. As the laws of nature play themselves out, the sun will explode tomorrow in a (very, very small) proportion of them. That proportion is what a frequentist’s probability represents. Some find this workaround satisfactory, while others believe that frequentists simply cannot assign probabilities to these sorts of one-off events.

Even when working with seemingly repeatable experiments, frequentism runs into some philosophical issues. Consider, for example, the classic experiment of tossing coin. If we don’t toss the coin exactly the same way, in exactly the same place, are we really repeating the experiment? Is it ever possible to exactly repeat an experiment? Even if we turn to the infinite universe workaround and imagine that the coin is flipped exactly the same way in each universe, classical mechanics tells us that the flip will have the identical outcome in each universe (see Figure 3.6). When repeated exactly, the experiment has no randomness! In the case of coin flipping, as well as many other experiments, the experimental outcome’s sensitivity to imperceivable or unknowable changes in the experiment’s initial conditions is actually what gives the appearance of randomness. Except for experiments whose outcomes vary because of true, fundamental randomness (like randomness at the quantum level, if you believe in that sort of thing), a precise frequentist definition of probability falls apart.

Figure 3.6: Coin flip outcome as function of initial spin rate (y-axis, in 1/seconds) and air time (x-axis). Spin rate corresponds to initial angular velocity, air time to initial upward velocity. Initial conditions leading to heads are hatched, tails are white. Coin flips appear random when there is large initial angular and/or vertical velocity, because the outcome is very sensitive to the initial conditions in that regime.

Still, frequentism is an incredibly important and useful philosophy. We liken it to Newton’s theory of gravity. Is it true that there is a force of attraction between objects as Newton suggested? Einstein’s theory of relativity, which makes the same predictions as Netwon’s theory in standard cases but more accurate predictions when supermassive objects are present, says no. Einstein’s theory claims that gravity is a result of objects warping space-time, not attracting one another (although physicists today would point out that Einstein’s theory fails at the quantum level, and is thus also “incorrect”). Still, in appropriate settings, Netwon’s theory of gravity accurately captures how the world behaves, despite being “false”. In fact, it is Newton’s theory that allowed us to land a rocket on the moon. So, is coin flipping truly random? No, it’s not. But we may as well pretend its random, just as we often pretend that objects exert attractive forces on one another. Like Newton’s theory of gravity, frequentism provides a view of the world that, although “incorrect”, helps us make useful predictions and draw accurate inferences when used in the appropriate settings. If we’re studying something that appears to behave randomly in the way that frequentists describe (e.g., coin flips), then frequentism is an apt and useful viewpoint to adopt. It’s up to us to determine when that is and isn’t the case.

Computer generated randomness and frequentism

Just as the case with coin flips, many computer random number generators (RNGs) are totally deterministic. If you’re privy to the right information (e.g., the random seed, the formula relating previous numbers to the next one) you can exactly predict the next value. In fact, some cunning gamblers used to (and likely still do) infer this information by observing a few hundred pulls of a slot machine, and would then predict when the next big jackpot would hit. Still, RNGs are designed to produce sequences that appear random by a frequentist’s standards, and for most typical cases they do so very well. When working with randomness from a computer, you can almost always be confident that being frequentist is a reasonable choice.

3.3.2 Bayesianism

Unlike frequentists, Bayesians consider the probability \(P(A)\) to be a degree of belief about whether the event \(A\) will happen or not. One way of making this concrete is relating beliefs to betting: a person’s degree of belief determines whether or not they are willing to buy/sell certain wagers. Let’s review what it means to buy/sell a wager. If I buy a \(x\) dollar wager on \(A\) happening that pays \(y\) dollars, that means I profit \(y\) dollars if \(A\) happens and lose \(x\) dollars if it doesn’t. Selling a wager corresponds to taking the other side of the bet. If I sell a \(x\) dollar wager on \(A\) happening that pays \(y\) dollars, then if \(A\) happens I lose \(y\) dollars and if it doesn’t I profit \(x\) dollars.

Now, suppose I think there’s \(50\)% chance of \(A\) happening (i.e., I believe that \(P(A) = 0.5\)). That means I believe it’s equally likely for \(A\) to happen or not happen. To a Bayesian this has nothing to do with frequencies. It simply means that I’ll accept 1-to-1 wagers on \(A\) happening versus it not happening (or wagers that are even more favorable to me). Specifically, I’d buy a \(\$1\) wager on \(A\) happening so long as it pays at least \(\$1\). If it pays less than \(\$1\), I won’t buy it. I’d also sell a \(\$1\) wager on \(A\) happening that pays at most \(\$1\), but not one that pays more than \(\$1\).

If I instead assign the probability \(P(A) = 0.2\) to \(A\), then the wagers I’m willing to buy/sell change. I now believe the event is four times less likely to happen than not. Concretely, I’ll be willing buy a \(\$1\) wager on \(A\) happening so long as it pays at least \(\$4\), and sell a \(\$1\) wager on \(A\) happening so long as it pays at most \(\$4\). In this view, the probability of an event simply determines the tipping point at which a person is willing to buy/sell different wagers on the event happening.

As you may expect, the probabilities people assign to events should satisfy some properties. If they don’t, that person can be easily exploited.

Example 3.15 (Irrational beliefs cost money) Suppose there’s a sports game tomorrow, and the home-team will either win or lose the game; there’s no possibility of a tie. Our friend says that they think there’s an 80% chance the home team will win, but only a 10% chance that the away team will win. Something feels off. Let’s examine why these beliefs are, in a strong sense, unacceptable.

Let \(A\) be the event that the home team wins. Its complement \(A^c\) is the event that the away team wins. We will buy two wagers from our friend:

  1. First wager: Because our friend believes that \(P(A) = 0.8\), they think the home team is four times likely to win than to lose. We can buy a \(\$6\) wager from them that pays \(\$1.50\) if the home team wins.

  2. Second wager: Because our friend believes that \(P(A^c) = 0.1\), they think the away team is nine times more likely to lose than win. We can buy a \(\$1\) wager from them that pays \(\$9\) if the away team wins.

Now let’s our end profit/loss depending on whether the home team wins or loses:

  1. Home team wins: We make a \(\$1.50\) profit the first wager and incur a \(\$1\) loss the second wager, leaving us with a \(\$0.50\) profit.

  2. Away team wins: We make a \(\$9\) profit from our second wager and incur a \(\$6\) loss form our first wager, leaving us with a \(\$3\) profit.

No matter the outcome of tomorrow’s game, our friend is guaranteed to lose money and we’re guaranteed to profit! By scaling up these bets, we can make our profit (and therefore our friend’s loss) arbitrarily large.

It indeed turns out that our friend’s probabilities don’t satisfy Kolmogorov’s axioms (Definition 3.4). We’ll see early in the next chapter that, for their beliefs to satisfy the axioms, \(P(A)\) and \(P(A^c)\) must sum to one.

We say that our friend in Example 3.15 is susceptible to arbitrage: there is a finite collection of wagers that they’re willing to buy/sell that gaurantee them a loss.

Theorem 3.2, which we state without proof, tells us that you can avoid being arbitraged if you ensure that your belief’s satisfy Kolmogrov’s axioms.

Theorem 3.2 (Coherence and Kolmogorov’s axioms) If a Bayesian’s beliefs satisfy the Kolmogorov axioms, then they cannot be arbitraged.

It turns out that just finite additivity, not countable additivity, is sufficient to prevent arbitrage (see the note in Definition 3.4 for a reminder on the difference between finite and countable additivity). Also finite additivity and the first two axioms are necessary to prevent arbitrage: a person can always be arbitraged if their beliefs don’t satisfy the first two axioms and finite additivity. For this reason, some Bayesians, most notably Italian probabilist Bruno de Finetti, have supported replacing countable additivity with just finite additivity. Countable additivity, however, plays an essential role in proving many properties that other probabilists deem important. We ourselves will make use of countable additivity later on.

As Theorem 3.2 illustrates, Bayesianism comes with a nice and clean philosophical justification. As an added bonus, Bayesianism also allows us to assign probabilities to any event (unlike frequentism). But the variety of Bayesianism we’ve presented comes with one big downside: it is completely subjective. If you, as a federal regulatory agency, ask two different subjective Bayesian experts for the probability that a drug will have fatal side effects, they may give you two vastly different probabilities. Both of them, however, will be able to argue that their judgements are rational. Frequentism, on the other hand, is built around something that is more objective (albeit philosophically questionable): limiting frequencies. There are variations of the Bayesian philosophy that are less subjective, but they come with their own criticisms, and we do not cover them in this book.

3.4 Exercises

Exercise 3.1 Coming soon!

Exercise 3.2 Coming soon!

Exercise 3.3 Coming soon!

Exercise 3.4 Coming Soon!

Exercise 3.5 Coming Soon!