Appendix B — Math Review

This appendix provides a concise review of mathematical concepts that are assumed in this book.

B.1 Discrete Mathematics

Summation notation is shorthand for a sum of many terms. We write \[ \sum_{i=a}^{b} f(i) = f(a) + f(a+1) + \cdots + f(b-1) + f(b), \] where \(a\) and \(b\) are integers with \(a \leq b\). The variable \(i\) is called the index of summation.

Useful summation formulas:

  • Sum of consecutive integers: \[ \sum_{i=1}^{n} i = 1 + 2 + \cdots + n = \frac{n(n+1)}{2}. \tag{B.1}\]
  • Sum of consecutive squares: \[ \sum_{i=1}^{n} i^2 = 1^2 + 2^2 + \cdots + n^2 = \frac{n(n+1)(2n+1)}{6}. \tag{B.2}\]

Useful properties of summation:

  • Pulling out constants: \(\displaystyle \sum_{i=a}^b c\,f(i) = c \sum_{i=a}^b f(i)\)
  • Splitting a sum: \(\displaystyle \sum_{i=a}^b \left(f(i) + g(i)\right) = \sum_{i=a}^b f(i) + \sum_{i=a}^b g(i)\)
  • Reindexing: You can change the index variable via substitution. For example, let \(j = i + 1\), then \[ \sum_{i=a}^b f(i) = \sum_{j=a+1}^{b+1} f(j - 1) \]

B.2 Logic and Proofs

Given a statement “If \(P\), then \(Q\)” or “\(Q\) if \(P\)”, we can form related statements:

  1. Converse: “If \(Q\), then \(P\)” or “\(Q\) only if \(P\)” (distinct the original statement)
  2. Contrapositive: “If not \(Q\), then not \(P\)” (logically equivalent to the original statement)

As an example, consider the statement “If it is raining, then the ground is wet.” (“The ground is wet if it is raining.”)

  1. Converse: “If the ground is wet, then it is raining.” (“The ground is wet only if it is raining.”) This is not logically equivalent to the original statement. The ground may be wet for other reasons.
  2. Contrapositive: “If the ground is not wet, then it is not raining.” This is logically equivalent to the original statement.

If both a statement and its converse is true, then we say “\(Q\) if and only if \(P\)”. This is really two statements in one: “\(Q\) if \(P\)” and “\(Q\) only if \(P\)”. We usually must establish the truth of both statements separately.

A proof is an argument that establishes the truth of a logical statement, starting with accepted truths and using logical steps to arrive at the desired conclusion. Once a logical statement has been proven and becomes accepted truth, it becomes a theorem. (Lesser theorems may be called propositions or lemmas.)

An accepted truth may either be another theorem or an axiom, which is a statement that is assumed to be true without proof. Axioms are chosen so that they are simple, self-evident, and fundamental. All theorems are ultimately derived from axioms using logical rules. For example, probability theory is derived from the three axioms described in Definition 3.1.

Most logical steps are self-explanatory. However, some logical patterns are less obvious and deserve special attention.

B.2.1 Mathematical Induction

Mathematical induction is a technique for proving statements that are true for all natural numbers \(n\). It consists of two steps:

  1. Base case: Prove that the statement is true for the smallest value of \(n\) (typically \(n = 1\) or \(n = 0\)).
  2. Inductive step: Assume the statement is true for \(n = k\) (this assumption is called the inductive hypothesis), and then prove that it must also be true for \(n = k+1\).

If both steps are completed, then by the principle of mathematical induction, the statement is true for all natural numbers \(n\).

Theorem B.1 (Principle of Mathematical Induction) Let \(S(n)\) be a statement that depends on a natural number \(n\). If:

  1. \(S(1)\) is true (base case), and
  2. For all \(k \geq 1\), if \(S(k)\) is true, then \(S(k+1)\) is true (inductive step),

then \(S(n)\) is true for all natural numbers \(n\).

As an example of induction, we will prove Equation B.1 for all \(n \geq 1\).

  1. Base case (\(n = 1\)): \(\sum_{i=1}^{1} i = 1 = \frac{1(1+1)}{2}\), so the statement is true.
  2. Inductive step: Assume the statement is true for \(n = k\), i.e., \(\sum_{i=1}^{k} i = \frac{k(k+1)}{2}\). Then \[ \sum_{i=1}^{k+1} i = \sum_{i=1}^{k} i + (k+1) = \frac{k(k+1)}{2} + (k+1) = \frac{(k+1)(k+2)}{2}, \] which shows the statement is true for \(n = k+1\).

By induction, Equation B.1 holds for all \(n \geq 1\).

B.2.2 Proof by Contradiction

Proof by contradiction (also called reductio ad absurdum) establishes a statement by showing that assuming its negation leads to a logical contradiction.

The general strategy:

  1. Assume the statement to be proved is false.
  2. Show that this assumption leads to a contradiction (something that is logically impossible or contradicts a known fact).
  3. Conclude that the original statement must be true.

Theorem B.2 (Proof by Contradiction) To prove a statement \(S\), assume \(\neg S\) (not \(S\)) and derive a contradiction. Then \(S\) must be true.

As an example, we will prove that it is impossible to have a random variable \(X\) with a uniform distribution on the natural numbers (i.e., where every natural number has the same non-zero probability).

  1. Assume, for contradiction, that there is a random variable \(X\) such that \(P(X = k) = c\) for some constant \(c > 0\) and every natural number \(k = 0, 1, 2, \dots\).
  2. Then, \(P(X \geq 0) = \sum_{k=0}^\infty P(X = k) = \sum_{k=0}^\infty c = \infty\), which is greater than \(1\), so it is not a valid probability.
  3. Therefore, it is impossible for a random variable to have a uniform distribution on the natural numbers.

B.3 Calculus

B.3.1 Derivatives of a single-variable function

The derivative of a function \(f(x)\) at a point \(x\) is defined as \[ f'(x) \overset{\text{def}}{=}\lim_{h \to 0} \frac{f(x+h) - f(x)}{h}, \tag{B.3}\] provided this limit exists.

It is also common to write the function as \(\frac{dy}{dx}\).

Derivatives of common functions:

  • Polynomials: \(\displaystyle \frac{d}{dx}[x^n] = nx^{n-1}\) for any real number \(n\)
  • Exponential: \(\displaystyle \frac{d}{dx}[e^x] = e^x\)
  • Logarithm: \(\displaystyle \frac{d}{dx}[\log x] = \frac{1}{x}\) for \(x > 0\)

Techniques of differentiation:

  • Constant rule: \(\displaystyle\frac{d}{dx} [a f(x)] = af'(x)\)
  • Sum rule: \(\displaystyle\frac{d}{dx} [f(x) + g(x)] = f'(x) + g'(x)\)
  • Chain rule: \(\displaystyle\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)\)
  • Product rule: \(\displaystyle\frac{d}{dx} [f(x)g(x)] = f'(x)g(x) + f(x)g'(x)\)
  • Quotient rule: \(\displaystyle\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}\)
  • Inverse function theorem: If \(\frac{dy}{dx} \neq 0\), \(\displaystyle \frac{dx}{dy} = \frac{1}{\frac{dy}{dx}}\).

The second derivative is the derivative of the first derivative: \(\displaystyle f''(x) = \frac{d}{dx}[f'(x)]\).

Derivatives can be used to find the maximum or minimum of a function. If \(f'(x^*) = 0\) and

  • \(f''(x^*) > 0\), then \(x^*\) is a local minimum.
  • \(f''(x^*) < 0\), then \(x^*\) is a local maximum.

B.3.2 Derivatives of a multivariable function

For a function \(f(x_1, x_2, \ldots, x_n)\) of multiple variables, the partial derivative with respect to \(x_i\) is the derivative with respect to \(x_i\) while treating all other variables as constants. It is denoted \[ \frac{\partial f}{\partial x_i}. \]

For example, if \(f(x, y) = x^2y + 3xy^2\), then \[ \frac{\partial f}{\partial x} = 2xy + 3y^2 \quad \text{and} \quad \frac{\partial f}{\partial y} = x^2 + 6xy. \]

The gradient of \(f\) is the vector of all partial derivatives: \[ \nabla f = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right). \]

The Hessian matrix is the matrix of second partial derivatives: \[ \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}. \]

Gradients and Hessians can be used to find the maximum or minimum of a function. If \(\nabla f(x_1^*, \dots, x_n^*) = 0\) and

  • \(\nabla^2 f(x_1^*, \dots, x_n^*)\) is positive-definite, then \((x_1^*, \dots, x_n^*)\) is a local minimum.
  • \(\nabla^2 f(x_1^*, \dots, x_n^*)\) is negative-definite, then \((x_1^*, \dots, x_n^*)\) is a local maximum.

B.3.3 Integrals of a single-variable function

The definite integral of a function \(f(x)\) over an interval \([a, b]\) is defined as the limit of a sum (called the Riemann sum): \[ \int_a^b f(x) \, dx = \lim_{n \to \infty} \sum_{i=1}^{n} f(x_i^*) \Delta x, \] where the interval \([a, b]\) is divided into \(n\) subintervals of width \(\Delta x = \frac{b-a}{n}\) and \(x_i^*\) is a point in the \(i\)th subinterval. The integral represents the signed area under the curve \(y = f(x)\) between \(x = a\) and \(x = b\).

If \(a\) is \(-\infty\) or \(b\) is \(\infty\), then the integral is said to be an improper integral. For example, if \(b\) is \(\infty\), then the improper integral is defined as: \[ \int_a^{\infty} f(x) \, dx = \lim_{b \to \infty} \int_a^b f(x) \, dx. \tag{B.4}\]

To compute integrals, we use the following result.

Theorem B.3 (Fundamental Theorem of Calculus) If \(F(x)\) is an antiderivative of \(f(x)\) (i.e., \(F'(x) = f(x)\)), then \[ \int_a^b f(x) \, dx = F(b) - F(a). \]

Antiderivatives of common functions:

  • \(\int x^n \, dx = \frac{x^{n+1}}{n+1} + C\) for \(n \neq -1\)
  • \(\int e^x \, dx = e^x + C\)
  • \(\int \frac{1}{x} \, dx = \ln|x| + C\) for \(x \neq 0\)

Techniques of integration:

  • Constant rule: \(\displaystyle \int a f(x)\,dx = a \int f(x)\,dx\)
  • Sum rule: \(\displaystyle \int f(x) + g(x)\,dx = \int f(x)\,dx + \int g(x)\,dx\)
  • Substitution: \(\displaystyle \int f(g(x)) g'(x)\,dx = \int f(u) \,du\), where \(u = g(x)\).
  • Integration by parts: \(\displaystyle \int f(x)g'(x)\,dx = f(x)g(x) - \int g(x) f'(x)dx\)

B.3.4 Integrals of Multivariable Functions

The definite integral of a multivariable function \(f(x_1, x_2, \dots, x_n)\) over a region \(R\) in \(\mathbb{R}^n\) is written as \[ \int_R f(x_1, x_2, \dots, x_n) \, d x_1\, d x_2\, \cdots\, d x_n. \]

A common way to compute such integrals is by expressing them as iterated integrals over one variable at a time. For example, a function of three variables could be calculated as: \[ \int_{a_1}^{b_1} \int_{a_2(x)}^{b_2(x)} \int_{a_3(x, y)}^{b_3(x, y)} f(x, y, z) \, dz\, dy\, dx. \tag{B.5}\] Note that the limits of integration of each integral may depend on the variables of outside integrals.

To evaluate Equation B.5, we start from the innermost integral (i.e., over \(z\)) and evaluate the integral, treating all other variables as constants (i.e., \(x\) and \(y\)). We repeat this process with the next integral (i.e., over \(y\)), and move our way outwards.

B.3.5 Limits

The limit of a function \(f(x)\) as \(x\) approaches \(a\) is the value that \(f(x)\) approaches as \(x\) gets arbitrarily close to \(a\), denoted \[ \lim_{x \to a} f(x) = L. \]

Theorem B.4 (L’Hôpital’s Rule) If \(\lim_{x \to a} f(x) = 0\) and \(\lim_{x \to a} g(x) = 0\) (or both are \(\pm\infty\)), and the derivatives exist, then \[ \lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}, \tag{B.6}\] provided the limit on the right exists.

B.3.6 Taylor Series

A Taylor series is a representation of a function as an “infinite polynomial”. The Taylor series of a function \(f(x)\) about a point \(a\) is \[ \begin{align*} f(x) &= \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n \\ &= f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots, \end{align*} \tag{B.7}\] where \(f^{(n)}(a)\) denotes the \(n\)th derivative of \(f\) evaluated at \(a\), and \(f^{(0)}(a) = f(a)\).

Taylor series of common functions:

  • Geometric series: \(\displaystyle \frac{1}{1 - x} = \sum_{n=0}^\infty x^n = 1 + x + x^2 + \dots\) for \(|x| < 1\)
  • Binomial series: \(\displaystyle (1+x)^\alpha = \sum_{n=0}^{\infty} \binom{\alpha}{n} x^n\) for \(|x| < 1\)
  • \(\displaystyle e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots\) for all \(x\)
  • \(\displaystyle \ln(1+x) = \sum_{n=1}^{\infty} \frac{(-1)^{n+1}x^n}{n} = x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots\) for \(|x| < 1\)

B.4 Linear Algebra

B.4.1 Vectors and Matrices

A vector is an ordered list of numbers, typically written as a column: \[ \vec{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}. \]

A matrix is a rectangular array of numbers: \[ A = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n} \\ A_{21} & A_{22} & \cdots & A_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m1} & A_{m2} & \cdots & A_{mn} \end{bmatrix}. \]

An \(m \times n\) matrix has \(m\) rows and \(n\) columns. The transpose of a matrix \(A\), denoted \(A^\intercal\), is obtained by interchanging rows and columns: \[ (A^\intercal)_{ij} = A_{ji}. \]

B.4.2 Matrix Operations

Matrix addition: If \(A\) and \(B\) are both \(m \times n\) matrices, then \((A + B)_{ij} = A_{ij} + B_{ij}\).

Scalar multiplication: If \(c\) is a scalar, then \((cA)_{ij} = c \cdot A_{ij}\).

Matrix multiplication: If \(A\) is \(m \times n\) and \(B\) is \(n \times p\), then \(AB\) is the \(m \times p\) matrix with \[ (AB)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}. \]

Matrix multiplication is not commutative: in general, \(AB \neq BA\).

Matrix-vector multiplication: If \(A\) is \(m \times n\) and \(\vec{v}\) is an \(n\)-vector, then \(A\vec{v}\) is an \(m\)-vector with \[ (A\vec{v})_i = \sum_{j=1}^{n} A_{ij} v_j. \]

B.4.3 Matrix Inverse and Determinant

An \(n \times n\) matrix \(A\) is invertible (or nonsingular) if there exists a matrix \(A^{-1}\) such that \(AA^{-1} = A^{-1}A = I\), where \(I\) is the \(n \times n\) identity matrix (with 1s on the diagonal and 0s elsewhere).

The determinant of a \(2 \times 2\) matrix is \[ \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc. \tag{B.8}\]

For larger matrices, the determinant can be computed using cofactor expansion or other methods. A matrix is invertible if and only if its determinant is nonzero.

B.4.4 Special Matrices

A matrix \(A\) is symmetric if \(A^\intercal = A\).

A matrix \(P\) is a projection matrix (specifically, an orthogonal projection matrix) if \(P^2 = P\) and \(P^\intercal = P\). Projection matrices are used to project vectors onto subspaces.

The identity matrix \(I\) has 1s on the diagonal and 0s elsewhere: \(I_{ij} = 1\) if \(i = j\) and \(0\) otherwise.