The Art of Chance
A Beginner’s Guide to Probability
Preface
This textbook was designed for a two-quarter undergraduate sequence in probability (and some statistics) at Stanford University. Its goal is to introduce the concepts and applications of probability to a broad audience of scientists, social scientists, engineers, mathematicians, statisticians, and data scientists.
Why this Book?
There are already many introductory probability textbooks. Why another one?
The main reason we wrote this book was to better correspond to the Stanford academic calendar, which is divided into 10-week quarters instead of 15-week semesters. We found it difficult to cover in 10 weeks the same material that schools on the semester system cover in 15 weeks. We ended up dividing our probability course into two courses.
- The first course corresponds to the first half of this book, “Concepts and Applications”, which is a self-contained treatment of probability essentials. The intuition, the problem solving techniques, and the many applications are all here, as are the puzzles and paradoxes that make the subject lively.
- This part of the book only assumes single-variable calculus, making it more accessible to biology and economics students (who may not have multivariable calculus) and even to an advanced high-school student.
- The second course corresponds to the second half of this book, “Theory and Techniques”. Because we teach this material over two quarter courses, we are able to cover more than a one-semester course. We use this extra time to introduce some elements of statistical inference (maximum likelihood estimators, bias and variance, hypothesis testing) that do not appear in traditional probability textbooks but serve as wonderful motivation for abstract probability concepts, such as convolutions, limit theorems, and order statistics.
- This part of the book assumes multivariable calculus and linear algebra, both of which are necessary for a comprehensive understanding of probability and statistics. Linear algebra is not used until the “Multivariate Distributions” chapters, so it is possible to take linear algebra concurrently with a course covering this material.
Distinguishing Features
In addition to the considerations above, we made several other pedagogical decisions in writing this book. We highlight some of these below.
- We first cover all of discrete probability, followed by all of continuous probability (although the book does not need to be read this way, see below). This means that every concept is treated twice, once for discrete random variables and again for continuous random variables.
- We find that difficult concepts, such as joint distributions and conditional expectation, are easier to grasp if they are first introduced for discrete random variables, without the added complication of calculus.
- One concern is that by separating discrete and continuous random variables, learners may fail to see the connections between them. To help make these connections, the structure of the continuous chapters mirror the structure of the discrete chapters, with explicit signposting in the continuous chapters directing the reader to the corresponding results for discrete random variables. Also, the final chapters, 26 From Conditionals to Marginals and 27 Conditional Expectations, feature examples that mix discrete and continuous random variables, preparing learners to apply concepts in both settings.
- Calculus is de-emphasized in favor of arguments that offer more statistical intuition. For example:
- The famous continuous families (uniform, exponential, normal) can all be derived from location-scale transformations of a single representative. So we only need to derive the expectation and variance of a representative member of the family, and the general formula is obtained using properties of expectation.
- Although we write some double integrals for the sake of completeness, we show how geometry, symmetry, and conditioning allow us to avoid double integrals (or calculus altogether)! See 23 Joint Distributions and 24 Expectations Involving Multiple Random Variables for examples.
- We favor models that are specified hierarchically (i.e., by specifying first the distribution of \(X\), then the conditional distribution of \(Y | X\)), rather than jointly. We dedicate two entire chapters, 16 From Conditionals to Marginals and 26 From Conditionals to Marginals, to the use of Law of Total Probability for such models, which are omitted in many textbooks or treated as an afterthought.
- Hierarchical specifications are more common in statistics, especially in Bayesian statistics.
- Hierarchical models provide a more natural way to describe “mixed” distributions that are neither discrete nor continuous.
- Code snippets in the programming language R are integrated into the exposition of the book.
- R is used to do simulations to motivate concepts.
- R is used to perform calculations that are impractical to do by hand. In the online version of this book, the R snippets can even be run in the browser.
How to Use this Book?
For Instructors
Part One
Each chapter can be covered thoroughly in an 80-minute lecture or outlined in a 50-minute lecture. At Stanford, we cover this material in a 10-week quarter with three 50-minute lectures per week. Schools on a 15-week semester system would be able to cover this material more completely (or cover a selection of topics from Part Two).
We have designed the book to be modular so that chapters can be read in any (reasonable) order. For example, one pedagogical decision is whether joint distributions should be covered before or after expected value. We have written those chapters so that they can be read in either order. The dependency graph illustrates the relationships between the chapters in the discrete and continuous sections, in case you wish to cover the chapters in an order different from ours.
Part Two
Each chapter is designed to be covered in one 80-minute lecture. At Stanford, we cover this material in a 10-week quarter with two 80-minute lectures per week.
A 15-week semester probability course, with approximately 40 50-minute lectures, might be able to cover all of Part One, in addition to Chapters 28-36. This covers all the topics in a “traditional” probability course, except for order statistics (39 Minima and Maxima, 40 General Order Statistics), the beta distribution (41 Beta Distribution) and Jacobians (42 Multivariate Transformations). The chapters on “Estimation Theory” are unorthodox for a pure probability course, but they really provide excellent motivation for the probability concepts that students find most abstract, such as limit theorems.
For Students
This book is meant to be read. We have tried to choose examples that we think you will find interesting.
Definitions, theorems, and examples all appear in colored boxes. Proofs are included inside the box for the corresponding theorem. When a proof is not important, it is collapsed. We recommend that you skip proofs that are collapsed, especially on a first reading, unless you are interested.
You should run the code that is provided and try modifying it to see what it does.
Acknowledgements
Our colleagues provided feedback which improved this book, including John Duchi, Trevor Hastie, and Timothy Sun.
Several students in our courses also provided useful feedback, including Jack Hlavka, Viet Vu, and Ricky Rojas.
The influence of teachers and colleagues who have shaped the way we teach probability is unmistakable. Thank you to Joe Blitzstein, Matt Carlton, Kevin Ross, and Allan Rossman.
We also acknowledge the support of a Curriculum Transformation Seed Grant from the Stanford Vice Provost for Undergraduate Education and Center for Teaching and Learning, which funded the writing of this book.