Appendix B — Math Review

This appendix provides a concise review of mathematical concepts that are assumed in this book.

The first half of this book does not assume advanced mathematics. Fluency in single-variable calculus, including Taylor series, should be sufficient. A student who has completed AP Calculus BC is prepared to learn this material. Any standard calculus textbook, such as OpenStax Calculus Volumes 1 and 2, can serve as useful review.

Nevertheless, the book (and the subject of probability in general) requires mathematical maturity in problem solving, which is not necessarily correlated with mathematical knowledge. One way to build mathematical maturity is to read Concrete Mathematics by Graham, Knuth, and Patashnik.

The second half of this book assumes fluency in linear algebra and differential multivariable calculus. The best preparation for this material is the Stanford Math 51 textbook.

B.1 Discrete Mathematics

Summation notation is shorthand for a sum of many terms. We write \[ \sum_{i=a}^{b} f(i) = f(a) + f(a+1) + \cdots + f(b-1) + f(b), \] where \(a\) and \(b\) are integers with \(a \leq b\). The variable \(i\) is called the index of summation.

Useful summation formulas:

  • Sum of consecutive integers: \[ \sum_{i=1}^{n} i = 1 + 2 + \cdots + n = \frac{n(n+1)}{2}. \tag{B.1}\]
  • Sum of consecutive squares: \[ \sum_{i=1}^{n} i^2 = 1^2 + 2^2 + \cdots + n^2 = \frac{n(n+1)(2n+1)}{6}. \tag{B.2}\]

Useful properties of summation:

  • Pulling out constants: \(\displaystyle \sum_{i=a}^b c\,f(i) = c \sum_{i=a}^b f(i)\)
  • Splitting a sum: \(\displaystyle \sum_{i=a}^b \left(f(i) + g(i)\right) = \sum_{i=a}^b f(i) + \sum_{i=a}^b g(i)\)
  • Reindexing: You can change the index variable via substitution. For example, let \(j = i + 1\), then \[ \sum_{i=a}^b f(i) = \sum_{j=a+1}^{b+1} f(j - 1) \]

B.2 Logic and Proofs

Given a statement “If \(P\), then \(Q\)” or “\(Q\) if \(P\)”, we can form related statements:

  1. Converse: “If \(Q\), then \(P\)” or “\(Q\) only if \(P\)” (distinct the original statement)
  2. Contrapositive: “If not \(Q\), then not \(P\)” (logically equivalent to the original statement)

As an example, consider the statement “If it is raining, then the ground is wet.” (“The ground is wet if it is raining.”)

  1. Converse: “If the ground is wet, then it is raining.” (“The ground is wet only if it is raining.”) This is not logically equivalent to the original statement. The ground may be wet for other reasons.
  2. Contrapositive: “If the ground is not wet, then it is not raining.” This is logically equivalent to the original statement.

If both a statement and its converse is true, then we say “\(Q\) if and only if \(P\)”. This is really two statements in one: “\(Q\) if \(P\)” and “\(Q\) only if \(P\)”. We usually must establish the truth of both statements separately.

A proof is an argument that establishes the truth of a logical statement, starting with accepted truths and using logical steps to arrive at the desired conclusion. Once a logical statement has been proven and becomes accepted truth, it becomes a theorem. (Lesser theorems may be called propositions or lemmas.)

An accepted truth may either be another theorem or an axiom, which is a statement that is assumed to be true without proof. Axioms are chosen so that they are simple, self-evident, and fundamental. All theorems are ultimately derived from axioms using logical rules. For example, probability theory is derived from the three axioms described in Definition 3.1.

Most logical steps are self-explanatory. However, some logical patterns are less obvious and deserve special attention.

B.2.1 Mathematical Induction

Mathematical induction is a technique for proving statements that are true for all natural numbers \(n\). It consists of two steps:

  1. Base case: Prove that the statement is true for the smallest value of \(n\) (typically \(n = 1\) or \(n = 0\)).
  2. Inductive step: Assume the statement is true for \(n = k\) (this assumption is called the inductive hypothesis), and then prove that it must also be true for \(n = k+1\).

If both steps are completed, then by the principle of mathematical induction, the statement is true for all natural numbers \(n\).

Theorem B.1 (Principle of Mathematical Induction) Let \(S(n)\) be a statement that depends on a natural number \(n\). If:

  1. \(S(1)\) is true (base case), and
  2. For all \(k \geq 1\), if \(S(k)\) is true, then \(S(k+1)\) is true (inductive step),

then \(S(n)\) is true for all natural numbers \(n\).

As an example of induction, we will prove Equation B.1 for all \(n \geq 1\).

  1. Base case (\(n = 1\)): \(\sum_{i=1}^{1} i = 1 = \frac{1(1+1)}{2}\), so the statement is true.
  2. Inductive step: Assume the statement is true for \(n = k\), i.e., \(\sum_{i=1}^{k} i = \frac{k(k+1)}{2}\). Then \[ \sum_{i=1}^{k+1} i = \sum_{i=1}^{k} i + (k+1) = \frac{k(k+1)}{2} + (k+1) = \frac{(k+1)(k+2)}{2}, \] which shows the statement is true for \(n = k+1\).

By induction, Equation B.1 holds for all \(n \geq 1\).

B.2.2 Proof by Contradiction

Proof by contradiction (also called reductio ad absurdum) establishes a statement by showing that assuming its negation leads to a logical contradiction.

The general strategy:

  1. Assume the statement to be proved is false.
  2. Show that this assumption leads to a contradiction (something that is logically impossible or contradicts a known fact).
  3. Conclude that the original statement must be true.

Theorem B.2 (Proof by Contradiction) To prove a statement \(S\), assume \(\neg S\) (not \(S\)) and derive a contradiction. Then \(S\) must be true.

As an example, we will prove that it is impossible to have a random variable \(X\) with a uniform distribution on the natural numbers (i.e., where every natural number has the same non-zero probability).

  1. Assume, for contradiction, that there is a random variable \(X\) such that \(P(X = k) = c\) for some constant \(c > 0\) and every natural number \(k = 0, 1, 2, \dots\).
  2. Then, \(P(X \geq 0) = \sum_{k=0}^\infty P(X = k) = \sum_{k=0}^\infty c = \infty\), which is greater than \(1\), so it is not a valid probability.
  3. Therefore, it is impossible for a random variable to have a uniform distribution on the natural numbers.

B.3 Calculus

B.3.1 Derivatives of a single-variable function

The derivative of a function \(f(x)\) at a point \(x\) is defined as \[ f'(x) \overset{\text{def}}{=}\lim_{h \to 0} \frac{f(x+h) - f(x)}{h}, \tag{B.3}\] provided this limit exists.

It is also common to write the function as \(\frac{dy}{dx}\).

Derivatives of common functions:

  • Polynomials: \(\displaystyle \frac{d}{dx}[x^n] = nx^{n-1}\) for any real number \(n\)
  • Exponential: \(\displaystyle \frac{d}{dx}[e^x] = e^x\)
  • Logarithm: \(\displaystyle \frac{d}{dx}[\log x] = \frac{1}{x}\) for \(x > 0\)

Techniques of differentiation:

  • Constant rule: \(\displaystyle\frac{d}{dx} [a f(x)] = af'(x)\)
  • Sum rule: \(\displaystyle\frac{d}{dx} [f(x) + g(x)] = f'(x) + g'(x)\)
  • Chain rule: \(\displaystyle\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)\)
  • Product rule: \(\displaystyle\frac{d}{dx} [f(x)g(x)] = f'(x)g(x) + f(x)g'(x)\)
  • Quotient rule: \(\displaystyle\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}\)
  • Inverse function theorem: If \(\frac{dy}{dx} \neq 0\), \(\displaystyle \frac{dx}{dy} = \frac{1}{\frac{dy}{dx}}\).

The second derivative is the derivative of the first derivative: \(\displaystyle f''(x) = \frac{d}{dx}[f'(x)]\).

Derivatives can be used to find the maximum or minimum of a function. If \(f'(x^*) = 0\) and

  • \(f''(x^*) > 0\), then \(x^*\) is a local minimum.
  • \(f''(x^*) < 0\), then \(x^*\) is a local maximum.

B.3.2 Derivatives of a multivariable function

For a function \(f(x_1, x_2, \ldots, x_n)\) of multiple variables, the partial derivative with respect to \(x_i\) is the derivative with respect to \(x_i\) while treating all other variables as constants. It is denoted \[ \frac{\partial f}{\partial x_i}. \]

For example, if \(f(x, y) = x^2y + 3xy^2\), then \[ \frac{\partial f}{\partial x} = 2xy + 3y^2 \quad \text{and} \quad \frac{\partial f}{\partial y} = x^2 + 6xy. \]

The gradient of \(f\) is the vector of all partial derivatives: \[ \nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right). \]

The Hessian matrix is the matrix of second partial derivatives: \[ \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}. \]

Gradients and Hessians can be used to find the maximum or minimum of a function. If \(\nabla f(x_1^*, \dots, x_n^*) = 0\) and

  • \(\nabla^2 f(x_1^*, \dots, x_n^*)\) is positive-definite, then \((x_1^*, \dots, x_n^*)\) is a local minimum.
  • \(\nabla^2 f(x_1^*, \dots, x_n^*)\) is negative-definite, then \((x_1^*, \dots, x_n^*)\) is a local maximum.

B.3.3 Integrals of a single-variable function

The definite integral of a function \(f(x)\) over an interval \([a, b]\) is defined as the limit of a sum (called the Riemann sum): \[ \int_a^b f(x) \, dx = \lim_{n \to \infty} \sum_{i=1}^{n} f(x_i^*) \Delta x, \] where the interval \([a, b]\) is divided into \(n\) subintervals of width \(\Delta x = \frac{b-a}{n}\) and \(x_i^*\) is a point in the \(i\)th subinterval. The integral represents the signed area under the curve \(y = f(x)\) between \(x = a\) and \(x = b\).

If \(a\) is \(-\infty\) or \(b\) is \(\infty\), then the integral is said to be an improper integral. For example, if \(b\) is \(\infty\), then the improper integral is defined as: \[ \int_a^{\infty} f(x) \, dx = \lim_{b \to \infty} \int_a^b f(x) \, dx. \tag{B.4}\]

To compute integrals, we use the following result.

Theorem B.3 (Fundamental Theorem of Calculus) If \(F(x)\) is an antiderivative of \(f(x)\) (i.e., \(F'(x) = f(x)\)), then \[ \int_a^b f(x) \, dx = F(b) - F(a). \]

Antiderivatives of common functions:

  • \(\int x^n \, dx = \frac{x^{n+1}}{n+1} + C\) for \(n \neq -1\)
  • \(\int e^x \, dx = e^x + C\)
  • \(\int \frac{1}{x} \, dx = \ln|x| + C\) for \(x \neq 0\)

Techniques of integration:

  • Constant rule: \(\displaystyle \int a f(x)\,dx = a \int f(x)\,dx\)
  • Sum rule: \(\displaystyle \int f(x) + g(x)\,dx = \int f(x)\,dx + \int g(x)\,dx\)
  • Substitution: \(\displaystyle \int f(g(x)) g'(x)\,dx = \int f(u) \,du\), where \(u = g(x)\).
  • Integration by parts: \(\displaystyle \int f(x)g'(x)\,dx = f(x)g(x) - \int g(x) f'(x)dx\)

B.3.4 Integrals of a multivariable function

The definite integral of a multivariable function \(f(x_1, x_2, \dots, x_n)\) over a region \(R\) in \(\mathbb{R}^n\) is written as \[ \int_R f(x_1, x_2, \dots, x_n) \, d x_1\, d x_2\, \cdots\, d x_n. \]

A common way to compute such integrals is by expressing them as iterated integrals over one variable at a time. For example, a function of three variables could be calculated as: \[ \int_{a_1}^{b_1} \int_{a_2(x)}^{b_2(x)} \int_{a_3(x, y)}^{b_3(x, y)} f(x, y, z) \, dz\, dy\, dx. \tag{B.5}\] Note that the limits of integration of each integral may depend on the variables of outside integrals.

To evaluate Equation B.5, we start from the innermost integral (i.e., over \(z\)) and evaluate the integral, treating all other variables as constants (i.e., \(x\) and \(y\)). We repeat this process with the next integral (i.e., over \(y\)), and move our way outwards.

B.3.5 Limits

The limit of a function \(f(x)\) as \(x\) approaches \(a\) is the value that \(f(x)\) approaches as \(x\) gets arbitrarily close to \(a\), denoted \[ \lim_{x \to a} f(x). \]

Limits, besides being fundamental to calculus, appear throughout probability theory.

B.3.5.1 Properties of Limits

Limits obey several properties that allow us to compute limits of combinations of functions.

Theorem B.4 (Properties of limits) Suppose \(\lim_{x\to a} f(x) = L\) and \(\lim_{x\to a} g(x) = M\) exist (where \(L\) and \(M\) are real numbers or \(\pm\infty\)). Then:

  • Sum: \(\lim_{x\to a} (f(x) + g(x)) = L + M\), provided this is not an indeterminate form (e.g., \(\infty - \infty\)).
  • Product: \(\lim_{x\to a} (f(x) \cdot g(x)) = L \cdot M\), provided this is not an indeterminate form (e.g., \(0 \cdot \infty\)).
  • Quotient: \(\lim_{x\to a} \frac{f(x)}{g(x)} = \frac{L}{M}\), provided that \(M \neq 0\) and this is not an indeterminate form (e.g., \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\)).
  • Composition: If \(f\) is continuous at \(M\), then \(\lim_{x\to a} f(g(x)) = f(M)\).

B.3.5.2 Indeterminate Forms and L’Hôpital’s Rule

Some limits cannot be evaluated directly using the properties above because they take an indeterminate form, such as: - \(\frac{0}{0}\) (e.g., \(\lim_{x \to 0} \frac{\sin x}{x}\)) - \(\frac{\infty}{\infty}\) (e.g., \(\lim_{x \to \infty} \frac{x^2}{e^x}\)) - \(0 \cdot \infty\) (e.g., \(\lim_{x \to 0^+} x \ln x\)) - \(1^\infty\) (e.g., \(\lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n\)) - \(\infty - \infty\)

The following theorem provides a way to evaluate limits that take the indeterminate forms \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\).

Theorem B.5 (L’Hôpital’s Rule) If \(\lim_{x \to a} f(x) = 0\) and \(\lim_{x \to a} g(x) = 0\) (or both are \(\pm\infty\)), and the derivatives exist, then \[ \lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}, \tag{B.6}\] provided the limit on the right exists. If the limit on the right is still indeterminate, L’Hôpital’s rule can be applied repeatedly.

For other indeterminate forms, we can often transform them into \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\): - For \(0 \cdot \infty\): Rewrite as \(\frac{0}{1/\infty}\) or \(\frac{\infty}{1/0}\) to convert to a quotient. - For \(1^\infty\): Take the logarithm to convert to \(0 \cdot \infty\), then rewrite as a quotient.

B.3.5.3 Limits as \(n \to \infty\) (Asymptotic Limits)

In probability and statistics, we frequently study the behavior of sequences as the sample size \(n\) grows to infinity. These asymptotic limits are central to understanding convergence results like the Law of Large Numbers and the Central Limit Theorem.

Common asymptotic limits used in this book:

  • \(\lim_{n \to \infty} \left(1 + \frac{a}{n}\right)^n = e^a\) for any constant \(a\)
  • \(\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1}\)
  • \(\lim_{n \to \infty} \frac{n!}{(n-k)! n^k} = 1\) for any fixed integer \(k\)

When evaluating limits as \(n \to \infty\), it is often helpful to reparametrize by letting \(x = 1/n\) (or similar) and taking the limit as \(x \to 0\), which may allow the use of L’Hôpital’s rule or Taylor series expansions.

B.3.5.4 Limits in Improper Integrals

Limits are also used to define improper integrals when the interval of integration is unbounded. For example, if \(b\) is \(\infty\), then \[ \int_a^{\infty} f(x) \, dx = \lim_{b \to \infty} \int_a^b f(x) \, dx. \] This is discussed further in Equation B.4.

B.3.6 Taylor Series

A Taylor series is a representation of a function as an “infinite polynomial”. The Taylor series of a function \(f(x)\) about a point \(a\) is \[ \begin{align*} f(x) &= \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n \\ &= f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots, \end{align*} \tag{B.7}\] where \(f^{(n)}(a)\) denotes the \(n\)th derivative of \(f\) evaluated at \(a\), and \(f^{(0)}(a) = f(a)\).

Taylor series of common functions:

  • Geometric series: \(\displaystyle \frac{1}{1 - x} = \sum_{n=0}^\infty x^n = 1 + x + x^2 + \dots\) for \(|x| < 1\)
  • Binomial series: \(\displaystyle (1+x)^\alpha = \sum_{n=0}^{\infty} \binom{\alpha}{n} x^n\) for \(|x| < 1\)
  • \(\displaystyle e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots\) for all \(x\)
  • \(\displaystyle \ln(1+x) = \sum_{n=1}^{\infty} \frac{(-1)^{n+1}x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots\) for \(|x| < 1\)

B.4 Linear Algebra

B.4.1 Vectors and Matrices

A vector is an ordered list of numbers, typically written as a column: \[ \vec{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}. \]

A matrix is a rectangular array of numbers: \[ A = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n} \\ A_{21} & A_{22} & \cdots & A_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{bmatrix}. \]

An \(m \times n\) matrix has \(m\) rows and \(n\) columns. The transpose of a matrix \(A\), denoted \(A^\intercal\), is obtained by interchanging rows and columns: \[ (A^\intercal)_{ij} = A_{ji}. \]

B.4.2 Matrix Operations

Matrix addition: If \(A\) and \(B\) are both \(m \times n\) matrices, then \((A + B)_{ij} = A_{ij} + B_{ij}\).

Scalar multiplication: If \(c\) is a scalar, then \((cA)_{ij} = c \cdot A_{ij}\).

Matrix multiplication: If \(A\) is \(m \times n\) and \(B\) is \(n \times p\), then \(AB\) is the \(m \times p\) matrix with \[ (AB)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}. \]

Matrix multiplication is not commutative: in general, \(AB \neq BA\).

Matrix-vector multiplication: If \(A\) is \(m \times n\) and \(\vec{v}\) is an \(n\)-vector, then \(A\vec{v}\) is an \(m\)-vector with \[ (A\vec{v})_i = \sum_{j=1}^{n} A_{ij} v_j. \]

B.4.3 Matrix Inverse and Determinant

An \(n \times n\) matrix \(A\) is invertible (or nonsingular) if there exists a matrix \(A^{-1}\) such that \(AA^{-1} = A^{-1}A = I\), where \(I\) is the \(n \times n\) identity matrix (with 1s on the diagonal and 0s elsewhere).

The determinant of a \(2 \times 2\) matrix is \[ \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc. \tag{B.8}\]

For larger matrices, the determinant can be computed using cofactor expansion or other methods. A matrix is invertible if and only if its determinant is nonzero.

B.4.4 Special Matrices

A matrix \(A\) is symmetric if \(A^\intercal = A\).

A matrix \(P\) is a projection matrix (specifically, an orthogonal projection matrix) if \(P^2 = P\) and \(P^\intercal = P\). Projection matrices are used to project vectors onto subspaces.

The identity matrix \(I\) has 1s on the diagonal and 0s elsewhere: \(I_{ij} = 1\) if \(i = j\) and \(0\) otherwise.