1  Naive Definition of Probability

It’s 1964 in Los Angeles. A blonde-haired woman with a ponytail snatches another woman’s handbag. She flees the scene, but is spotted entering a yellow car driven by a Black man with a beard and a mustache. Later, you spot a woman with blonde hair in a ponytail sharing a yellow car with a bearded and mustachioed Black man. Is this a coincidence? Or, given the extensive set of matching characteristics, is this sufficient evidence to implicate the couple you’ve spotted?

This exact situation was the subject of the famed court case People v. Collins. The prosecution contended that the probability of such a couple existing was so astronomically low (one in 12 million) that the accused must be the actual culprits. The argument compelled the jurors, who reached a guilty verdict. The case was then appealed to the California Supreme Court, where the defense pointed out that the relevant probability was not that of such a couple existing, but that of more than one such couple existing. If multiple such couples existed, the prosecution couldn’t be sure that the accused was the guilty party. For a city the size of Los Angeles, the defense computed this later probability to over 40% (even while using the prosecution’s questionable and unfavorable assumptions). Persuaded by the defense’s argument, the court felt there was insufficient evidence to implicate the accused and overturned the guilty verdict.

Probability theory, the subject of this book, provides a mathematical framework for reasoning about and making decisions under uncertainty. Probabilities allow us to quantify the randomness in real-world systems. We encounter probability statements in everyday life, such as:

Also, the ability to quantitatively reason about uncertainty is crucial in many high-impact fields. We provide just a few examples below:

The impacts of these judgments can range from millions of dollars to millions of lives.

Yet probability theory has origins in a more frivolous, but timeless, endeavor: games. Although games of chance are as old as civilization itself (Figure 1.1 shows a 4000-year old die that looks surprisingly modern), it was not until the 16th century that the games were analyzed systematically.

Figure 1.1: Six-sided die from the Indus Valley Civilization, c.2000 BCE

1.1 A Need for a Mathematical Framework

Figure 1.2: Gerolamo Cardano

Gerolamo Cardano (1501-1576) was an Italian mathematician. He is best known for his 1545 book Ars Magna, in which he introduced negative and imaginary numbers.

In the same book, he published formulas for the roots of cubic and quartic polynomials. These formulas generalize the familiar quadratic formula, which tells us that \(x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\) gives the roots of the quadratic polynomial \(ax^2 + bx + c\), to 3rd and 4th degree polynomials.

One of the first people to analyze games of chance (and hence, one of the fathers of probability theory) was Gerolamo Cardano (1501-1576). In 1564, Cardano wrote a book, Liber de ludo aleae (“Book on Games of Chance”), in which he analyzed problems like the following.

Example 1.1 (Cardano’s problem) How many throws of a fair die do we need to have a \(50\%\) chance of at least one six?

Cardano’s incorrect solution

The probability of a six is \(1/6\). With \(k=3\) throws, a six should come up \(3 \times 1/6 = 1/2\) of the time.

Although Cardano’s solution sounds plausible, we will see that it is incorrect. It turns out that \(k=3\) throws gives only a 42.1% chance of at least one six, and it takes \(k=4\) throws to achieve a better than 50% chance.

Figure 1.3: Gottfried Leibniz

Gottfried Leibniz (1646 -1716) was a German mathematician. He is best known for developing differential and integral calculus, contemporaneously with but independently of Isaac Newton.

Another early pioneer in probability theory was Gottfried Leibniz (1646-1716). Leibniz is best remembered today as the co-inventor of calculus (with Isaac Newton). Leibniz is also attributed with documenting the first modern binary number system, an achievement for which some consider him to be the first computer scientist and information theorist. His far reaching accomplishments have led to him being labeled as the “last universal genius”.

Leibniz, the brilliant 17th century mathematician who co-invented calculus with Isaac Newton, also dabbled in probability theory. In his Opera Omnia (“Complete Works”), he studied the following question.

Example 1.2 (Leibniz’s problem) If we throw two fair dice, which outcome is more likely: an 11 or a 12?

Leibniz’s incorrect solution

Because there is only one way to roll an 11 (one five and one six) and only one way to roll a 12 (two sixes), they are equally likely.

Leibniz’s solution is also wrong, and his mistake is perhaps easier to spot (we point it out later, but see if you can spot it now!). Interestingly, a similar but harder problem (see Exercise 2.1) was correctly answered by renowned astronomer and scientist Galileo Galilei (1564-1642). Had Leibniz been aware of Galileo’s solution, he likely wouldn’t have made this famed mistake!

Figure 1.4: Jean le Rond d’Alembert

Jean le Rond d’Alembert (1717 -1783) was a French mathematician who is best known for d’Alembert’s principle, a fundamental statement about the classical laws of motion. He also invented the ratio test for the convergence of a series and is known for his work on the fundamental theorem of algebra, which is sometimes called d’Alembert’s theorem in honor of his pioneering attempt to prove it.

Another famed erroneous probability calculation is that of French physicist and mathematician Jean le Rond d’Alembert (1717-1783). D’Alembert studied the following problem in his 1754 article Croix ou Pile (“Heads or Tails”).

Example 1.3 (D’Alembert’s problem) In two tosses of a fair coin, what’s the probability of seeing at least one heads?

D’Alembert’s incorrect solution

Once a head is observed, there’s no need for a second throw. Thus, the possible observable outcomes are H, HT, TH. We see a head in 2/3 of these, so the answer is 2/3.

D’Alembert’s idea to consider the proportion of outcomes where a heads appears was reasonable and in line with contemporary methods. Unfortunately, as we will see, it wasn’t executed correctly.

Our purpose in recounting these historical blunders is three-fold. First, it emphasizes that probability needs a mathematical foundation. If, over the span of hundreds of years, history’s greatest minds incorrectly computed probabilities when relying on intuition alone, we likely will too. Second, it reminds us how far probability theory has come, and how grateful we should be to enjoy the fruits of its development. By the end of this chapter we’ll fully understand Leibniz’s and d’Alembert’s errors, and by the end of the next we’ll be able to compute the probabilities that eluded Cardano. Lastly, and most importantly, these examples illustrate that probability can be tricky. If Leibniz, who invented a subject as complex as calculus, could botch a basic probability calculation, then we might all struggle with the subject. As you go through this book, don’t be discouraged if concepts seem challenging or counterintuitive at first. With enough practice, anyone can master probability theory!

1.2 A Naive Definition of Probability

Figure 1.5: Jacob Bernoulli

Jacob Bernoulli (1654 - 1705) was a Swiss mathematician. He was an early luminary in a family whose lineage boasts many notable academics. Bernoulli made a number of important contributions to mathematics, including helping found the calculus of variations and discovering the mathematical constant \(e\).

Bernoulli is most famed, however, for his contributions to probability theory. His article Ars Conjectandi, published 8 years after his death, included a derivation of the binomial distribution (discussed in Chapter 8) and, most importantly, the first law of large numbers. Bernoulli, who spent 20 years attempting to prove this law, referred to it as his “golden theorem”. The Bernoulli distribution (also discussed in Chapter 8) is aptly named in his honor.

Jacob Bernoulli’s 1713 posthumous article Ars Conjectandi (“The Art of Conjecturing”) provided the first mathematically rigorous treatment of probability theory. Although limited by today’s standards, Bernoulli’s treatment is still incredibly insightful and powerful. It will allow us to correctly reason about non-trivial problems, including the ones that duped Cardano, Leibniz and d’Alembert.

Bernoulli’s definition considers an experiment that has a finite number of random, but equally likely, outcomes.

Definition 1.1 (Sample space) The sample space of an experiment, denoted by \(\Omega\), is the set of all the experiment’s possible outcomes.

Definition 1.2 (Events) An event \(A \subseteq \Omega\) is a subset of the sample space, i.e., it is a collection of some outcome(s).

An event \(A\) “happens” whenever the experiment’s actual realized outcome belongs to the subset \(A\). Our interest lies in determining the probability of whether particular events will happen or not.

Example 1.4 (Rolling a die) Consider the experiment of rolling a six-sided die. If the die is fair, i.e., it is symmetric and hasn’t been tampered with in some malicious way, then the six possible outcomes \(\Omega = \{1, 2, 3, 4, 5, 6\}\) should be equally likely. We may be interested in the event \(A\) of rolling an even number: \[ A = \{2, 4, 6\}. \] The event \(A\) “happens” whenever we roll a \(2\), \(4\), or \(6\).

We provide a couple more examples of experiments, sample spaces, and events below. They further illustrate that, although Bernoulli’s formalization is mathematical, these objects are still best understood and most easily discussed in plain English. For the remainder of the book, when we describe something as being random or happening randomly, it means that the different outcomes are equally likely. There are plenty of situations where random outcomes are not equally likely (you’ll see some later in this chapter), and we’ll avoid any confusion by explicitly specifying whenever this is the case.

Example 1.5 (Selecting a card) A slightly more complicated experiment involves randomly selecting a card from a complete deck of playing cards (sans jokers). Denoting each card by its rank (\(2\) through \(10\) for numbered cards, \(\text{J}\) for Jack, \(\text{Q}\) for Queen, \(\text{K}\) for King, and \(\text{A}\) for Ace) and suit (\(\clubsuit\) for Clubs, \(\diamondsuit\) for Diamonds, \(\spadesuit\) for Spades, and \(\heartsuit\) for Hearts), the sample space \[ \Omega = \{\text{A} \heartsuit, \dots, \text{K} \heartsuit, \text{A} \clubsuit, \dots, \text{K} \clubsuit, \text{A} \diamondsuit, \dots, \text{K} \diamondsuit, \text{A} \spadesuit, \dots, \text{K} \spadesuit \} \] consists of the \(52\) cards in the deck. Some possible events of interest are \[ \{ \text{selecting a king} \} = \{\text{K} \clubsuit, \text{K} \diamondsuit, \text{K}\spadesuit, \text{K}\heartsuit\}, \]

\[ \{ \text{selecting a spade} \} = \{\text{A} \spadesuit, 2 \spadesuit, 3 \spadesuit, 4 \spadesuit, 5 \spadesuit, 6 \spadesuit, 7 \spadesuit, 8 \spadesuit, 9 \spadesuit, 10 \spadesuit, \text{J} \spadesuit, \text{Q} \spadesuit, \text{K} \spadesuit \}, \]

\[ \{\text{selecting a heart or diamond face card}\} = \{ \text{J} \heartsuit, \text{Q} \heartsuit, \text{K} \heartsuit, \text{J} \diamondsuit, \text{Q} \diamondsuit, \text{K} \diamondsuit\}. \]

\[ \{\text{selecting the $3$ of spades}\} = \{3 \spadesuit \}. \]

Events with a single outcome

The event \(\{3 \spadesuit \}\), which is a set of outcome(s), is techincally not the same thing as the indivdual outcome \(3 \spadesuit\).

Example 1.6 (Three random babies) There’s been a mix-up at the hospital and the doctors don’t remember which newborn babies belong to which parents! There are three babies and three sets of parents. Not knowing what to do, the doctors return the babies randomly (each possible way of returning the babies is equally likely). Let’s say the baby \(\text{B}1\) is born to the first couple, \(\text{B}2\) to the second, and \(\text{B}3\) to the third. We can write the sample space of this “experiment” as \[ \Omega = \{(\text{B}1, \text{B}2, \text{B}3), (\text{B}1, \text{B}3, \text{B}2), (\text{B}2, \text{B}1, \text{B}3), (\text{B}2, \text{B}3, \text{B}1), (\text{B}3, \text{B}1, \text{B}2), (\text{B}3, \text{B}2, \text{B}1) \}, \]
where the slot the baby is in determines what couple they are returned to. For example, \((\text{B}1, \text{B}3, \text{B}2)\) corresponds to the first couple receiving their baby, but the second and third couples receiving each other’s babies. There are a number events we may be interested in: \[\{\text{no baby is returned correctly}\} = \{(\text{B}2, \text{B}3, \text{B}1), (\text{B}3, \text{B}1, \text{B}2)\} \] \[\{\text{some baby is returned correctly} \} = \{(\text{B}1, \text{B}2, \text{B}3), (\text{B}1, \text{B}3, \text{B}2), (\text{B}2, \text{B}1, \text{B}3), (\text{B}3, \text{B}1, \text{B}2) \} \] \[\{\text{all babies are returned correctly} \} = \{(\text{B}1, \text{B}2, \text{B}3)\} \] \[\{\text{exactly two babies are returned correctly} \} = \emptyset\]

The empty set

We use \(\emptyset\) to denote the empty set \(\{\}\).

Within this framework, Bernoulli defines the probability of an event to be the ratio of the number of outcomes in the event to the total number of outcomes in the sample space.

Definition 1.3 (Cardinality) The cardinality or size of a set \(A\), denoted by \(|A|\), is the number of elements in the set.

Definition 1.4 (Naive definition of probability) For an experiment with sample space \(\Omega\), the probability \(P(A)\) of the event \(A \subseteq \Omega\) is the ratio of the number of outcomes in the event to the total number of outcomes in the sample space: \[P(A) = |A|/|\Omega|. \]

What exactly is the probability \(P(A)\) meant to represent? For now we’ll adopt the frequentist viewpoint. Frequentists believe that probabilities should measure the long run frequencies of random events. A frequentist would find Bernoulli’s formalization suitable if, as we repeat the experiment again and again, the proportion of experiments in which the event \(A\) happens approaches \(P(A)\). We elaborate more on frequentism and contrast it to Bayesianism, which offers a totally different interpretation of probabilities, in Chapter 3.

An incorrect interpretation of probability

A common misconception is that if the probability of our favored political candidate winning an election is 80% then the candidate will get 80% of the votes. This interpretation is incorrect and aligns with neither the frequentist nor Bayesian philosophy!

So how should one interpret this probability? Even for a frequentist it can be tricky. One interpretation is that, if we were to repeat the election again and again, then our favored canditate will win 80% of time. We’ll learn more about how to interpret this probability in Chapter 3.

Equipped with Definition 1.4, we can revisit our earlier examples and compute some probabilities.

Example 1.7 (Computing some probabilities)  

  • Consider the die rolling example (Example 1.4) and let \(A\) be the event of rolling an even number. Then, \[ P(A) = |A|/|\Omega| = 3/6 = 1/2 \] Since half the numbers on the die are even, it makes sense that we would roll an even number around half the time if we repeatedly rolled the die.

  • Consider the card selecting example (Example 1.5) and let \(A\) be the event of selecting a king. Then, \[ P(A) = |A|/|\Omega| = 4/52 = 1/13 \] Since \(1/13\) of the cards are kings, it makes sense that we would pick a king \(1/13\) of the time if we repeatedly drew a random card from the deck.

  • Consider the random babies example (Example 1.6) and let \(A\) be the event that exactly two babies are returned correctly. Then, \[ P(A) = |A|/|\Omega| = 0/6 = 0 \] It makes sense that we would never return exactly two babies correctly if we repeatedly randomly returned the babies because there is no way of doing so!

1.3 More Examples

Even with just Bernoulli’s naive definition of probability, we can make reasonable quantitative judgements about a number of interesting problems involving chance and uncertainty.

Our first example comes from craps, a game of chance played with dice. We provide a brief introduction to craps for those who are unfamiliar with the game.

Figure 1.6: A craps table

Craps is gambling game played by repeatedly rolling a pair of fair six-sided dice. If the come out roll (the first roll) is a \(2\), \(3\), \(7\), \(11\), or \(12\), the game ends immediately. Otherwise, the come out roll (which must be a \(4\), \(5\), \(6\), \(8\), \(9\), or \(10\)) is set as the point, and the shooter (person rolling the dice) keeps rolling until they either seven out, meaning they roll a \(7\), or they roll the point again.

We consider just a couple of the many bets players can make:

  • Pass-line bet: A pass line bet pays \(\$1\) for every \(\$1\) wagered. The bettor wins automatically if the come out roll is a \(7\) or \(11\) and loses automatically if it is a \(2\), \(3\), or \(12\). Otherwise, the bettor wins if the shooter rolls the point again before sevening out.

  • Don’t pass bet: A don’t pass bet also pays \(\$1\) for every \(\$1\) wagered, but it is almost the opposite of a pass-line bet. The bettor wins automatically if the first roll is a \(7\) or \(11\) and loses automatically if it is a \(2\), \(3\). When the come out roll is a \(12\), there is a push (a tie) and the bettor is simply returned their wager with no profit. Otherwise, the bettor wins if the shooter sevens out before rolling the point again.

Example 1.8 (Winning a pass-line bet on the come out roll) If we make a pass-line bet in craps, what’s the probability of winning on the come out roll?

We’ll treat the game of craps as our “experiment” and imagine that one of the dice we’re playing with is black and the other is white. Since we only care about the come out roll, we can denote all the possible outcomes as \((x, y)\) where \(x\) is black die’s value on the come out roll and \(y\) is the white’s. The sample space \(\Omega\), depicted in Figure 1.7, then contains \(36\) equally likely outcomes: all possible pairs \((x, y)\) where \(x = 1,\dots, 6\) and \(y = 1, \dots, 6\).

The event \(A\) of winning a pass-line bet on the come out roll contains exactly the outcomes \((x, y)\) such that \(x + y = 7\) and \(x + y = 11\). Figure 1.7 highlights all the outcomes in \(A\):

\[ A = \{(1, 6), (6, 1), (5, 2), (2, 5), (3, 4), (3, 4), (5, 6), (6, 5)\}. \]

Therefore, the probability of winning on the come out roll is

\[ P(A) = |A|/|\Omega| = 8/36 = 2/9. \]

Figure 1.7: Sample space for the experiment of rolling two six-sided dice. Outcomes in the event \(A\) are highlighted in red.

We can now also correctly solve Leibniz’s problem and identify his mistake.

Example 1.9 (Solving Leibniz’s problem) Considering the experiment of rolling two fair six-sided dice, Leibniz wanted to know if it is more likely to roll an \(11\) or a \(12\). He himself erroneously claimed that the two were equally likely.

Adopting the sample space depicted in Figure 1.7 from our craps example (Example 1.8), it is easy to spot Leibniz’s mistake. Although he was correct that the only way to roll an \(11\) is to roll one five and one six, there are two ways, \((5, 6)\) and \((6, 5)\), of doing so. In contrast, there’s only one way, \((6, 6)\), to roll a 12. Therefore, the probability of rolling an \(11\),

\[ P\{(\text{rolling an 11}\}) = P(\{(5, 6),(6, 5)\}) = 2/36 = 1/18 \]

is twice that of rolling a \(12\),

\[ P(\{\text{rolling a 12}\}) = P(\{(6, 6)\}) = 1/36. \]

In both our study of craps (Example 1.8) and Lebniz’s problem (Example 1.9), we were interested in the sum of the two rolled dice. Why then did we parameterize our sample space in terms of the individual die rolls rather than their sum? Why not use the sample space \(\Omega = \{2, \dots, 12\}\)?

Bernoulli’s definition of probability (Definition 1.4) is only reasonable if the outcomes in our sample space are equally likely. When using it, we need to be careful to pick a sample space for which that is the case. Otherwise, as the below example illustrates, it can lead to some nonsensical conclusions.

Example 1.10 (An overly optimistic lottery computation) Suppose we want to compute the probability of winning the jackpot in the next Mega-Millions lottery. Our “experiment” has two outcomes, winning the jackpot or losing it. Therefore, Definition 1.4 tells us that the probability of winning is \(1/2\).

In the case of rolling two dice, there’s no basis for assuming that the different possible sums of the two rolled dice are equally likely. Conversely, the dice’s symmetric build and the manner in which they are tossed support the assumption that the the outcomes in Figure 1.7 should be.

Failing to recognize that certain outcomes were not equally likely was exactly d’Alembert’s mistake.

Example 1.11 (Solving d’Alembert’s problem) D’Alembert wanted to know the probability of seeing at least one head in two consecutive tosses of a fair coin. For the experiment of tossing a fair coin twice, there are four equally likely outcomes: \(\Omega =\{HH, HT, TH, TT\}\). We can compute d’Alembert’s probability of interest to be

\[ P(\text{at least one head}) = P(\{HH, HT, TH\}) = 3/4. \]

How then did d’Alembert happen on \(2/3\)? He claimed that the experimenter would stop the experiment upon seeing the first head, and the experiment’s possible outcomes are therefore \(\{H, TH, TT\}\). Only \(2\) out of these three \(3\) outcomes have a head. D’Alembert, however, failed to recognize that these outcomes are not equally likely and therefore not suited for the sort of probability computation Bernoulli has suggested.

Without relying on Bernoulli’s definition, we provide an intuitive argument for why d’Alembert’s outcomes \(H\) and \(TH\) are not equally likely. For a fair coin, the probability of getting a heads \(H\) on the first flip should be the same as that of getting a tails \(T\). But getting a tails and then a heads \(TH\) in two consecutive flips should certainly be less likely than just getting a tails \(T\) for the first flip, because there are situations in which we get a tails first and NOT a heads on the second (specifically, \(TT\)). Thus it is not reasonable to think that getting a heads \(H\) on the first flip is equally likely to getting a tails and then heads \(TH\) in two consecutive flips.

Our last example considers a tricky card game that we wager would still fool some great minds today.

Example 1.12 (A tricky card game) A gambler on the side of the road stops you and asks to play a game involving three cards. One card is black on both sides, one is red on both sides, and one is black on one side and red on the other. The gambler shuffles the cards in a hat, asks you to randomly select one, and then places your selected card on a table. You both see that the side facing up is red, and neither of you have seen the other side. The gambler then says, “The card you selected must either be red on both sides or red on one side and black on the other. The selection was perfectly random, so these two cases should be equally likely. I’ll wager that the other side of your selected card is red”. If wrong, the gambler offers to pay you \(\$1\) for every \(\$1\) you wager. Should you take the bet?

To properly analyze the game we need to know the gambler’s strategy: whenever the side facing up is red the gambler wagers that the opposite side is also red, and whenever it is black they wager the opposite side is also black. We can view the game as an experiment with three equally likely outcomes, each corresponding to the card you select:

\[ \Omega = \{\text{\textcolor{red}{red}-\textcolor{red}{red}, \textcolor{red}{red}-black, black-black}\} \]

The gambler wins whenever the card you pick has the same color on both sides, i.e.,

\[ P(\text{gambler wins}) = P(\{\text{\textcolor{red}{red}-\textcolor{red}{red}, black-black}\}) = 2/3. \]

Since the gambler will win \(2/3\) of the time in the long run, you should not take the bet!

Even after working carefully through Example 1.12, it may be difficult to spot the flaw in the gambler’s purposefully misleading reasoning. But worry not! We’ll clearly pinpoint this flaw in Chapter 5 when we cover conditional probabilities. We’ll even be able redo Example 1.12’s analysis without specifying the gambler’s strategy.

1.4 Exercises

Coming soon!