Symbulate Documentation

Multiple Random Variables and Joint Distributions

Be sure to import Symbulate using the following commands.

In [1]:
from symbulate import *
%matplotlib inline

Defining multiple random variables

Many problems involve several random variables defined on the same probability space. Of interest are properties of the joint distribution which describe the relationship between the random variables.

In the context of multiple random variables, the distribution of any single random variable is referred to as a marginal distribution. Joint distributions fully specify the corresponding marginal distributions; however, the converse is not true (unless the random variables are independent.)

Example. Roll two fair six-sided dice. Let $X$ be the sum of the two dice and $Y$ the larger of the two rolls.

In [2]:
die = list(range(1, 6 + 1))
P = BoxModel(die, size=2)
X = RV(P, sum)
Y = RV(P, max)

Simulating pairs (or tuples) of values with &

Joining X and Y with an ampersand & and calling .sim() simultaneously simulates the pair of (X, Y) values for each simulated outcome of the probability space. The simulated results can be used to approximate the joint distribution of X and Y which describes the possible pairs of values and their relative likelihoods. Likewise, tuples of values of multiple random variables can be simulated simultaneously using the ampersand &. Simulation tools like .sim(), .tabulate(), etc work as before.

Example. Roll two fair six-sided dice. Let $X$ be the sum of the two dice and $Y$ the larger of the two rolls.

In [3]:
die = list(range(1, 6 + 1))
P = BoxModel(die, size=2)
X = RV(P, sum)
Y = RV(P, max)
(X & Y).sim(10000).tabulate(normalize=True)
Out[3]:
Outcome Value
(2, 1)0.0246
(3, 2)0.0542
(4, 2)0.0299
(4, 3)0.0549
(5, 3)0.0554
(5, 4)0.0532
(6, 3)0.0273
(6, 4)0.0548
(6, 5)0.0559
(7, 4)0.0563
(7, 5)0.0568
(7, 6)0.0568
(8, 4)0.0294
(8, 5)0.0519
(8, 6)0.0626
(9, 5)0.0519
(9, 6)0.0527
(10, 5)0.0285
(10, 6)0.0557
......
(12, 6)0.0287
Total0.9999999999999998

Visualizing simulation results with .plot()

Calling .plot() for simulated (X, Y) pairs produces a scatterplot summary of the simulated values. When the variables are discrete, it is recommended to use the jitter=True option to better visualize relative frequencies. The alpha= parameter controls the level of transparency.

Example. Roll two fair six-sided dice. Let $X$ be the sum of the two dice and $Y$ the larger of the two rolls.

In [4]:
die = list(range(1, 6 + 1))
P = BoxModel(die, size=2)
X = RV(P, sum)
Y = RV(P, max)
(X & Y).sim(10000).plot(jitter = True, alpha = 0.01)

See the section on Symbulate graphics for more details on plotting options and functionality.

Commonly used joint distributions

Recall that a RV can be defined by specifying its distribution directly. Similarly, multiple RVs can be defined by specifying the joint distribution directly. Several commonly used joint distributions are built in to Symbulate. For example, a multivariate normal distribution is a joint distribution parametrized by a mean vector and a covariance matrix.

In [5]:
covmatrix = [[1, -0.5],
             [-0.5, 4]]
X, Y = RV(MultivariateNormal(mean = [0, 1], cov = covmatrix)) # see below for notes on "unpacking"

Custom joint distributions can be specified using ProbabilitySpace. For example, it is possible to specify a joint distribution via conditional and marginal distributions.

Defining independent random variables

Intuitvely, a collection of random variables are independent if knowing the values of some does not influence the joint distribution of the others. Random variables $X$ and $Y$ are independent if and only if the joint distribution factors into the product of the corresponding marginal distributions. That is, for independent RVs the joint distribution is fully specified by the marginal distributions.

Recall that a RV can be defined by specifying its distribution directly. When dealing with multiple random variables it is common to specify the marginal distribution of each and assume independence. In Symbulate, independence of distributions is represented by the asterisks *. The * syntax reflects that under independence joints objects (i.e. cdf, pdf) are products of the corresponding marginal objects.

Example. Let $X$, $Y$, and $Z$ be independent, with $X$ having a Binomial(5, 0.5) distribution, $Y$ a Normal(0,1) distribution, and $Z$ a Uniform(0,1) distribution.

In [6]:
X, Y, Z = RV(Binomial(5, 0.5) * Normal(0, 1) * Uniform(0, 1))  # see below for notes on "unpacking"
(X & Y & Z).sim(10000)
Out[6]:
Index Result
0(3, 2.049377484384863, 0.01452667745811631)
1(3, 0.2965996738278222, 0.3874335036588029)
2(1, 0.12456909837553246, 0.8194704031472549)
3(1, -1.96786535566716, 0.7164738133837657)
4(1, 0.2818197487132637, 0.09637422851842081)
5(3, -1.1838853596161603, 0.26669925495669444)
6(3, -1.1226557252476137, 0.28602775676606274)
7(2, -0.5372350753685712, 0.04436664246957134)
8(1, 1.2210670521121552, 0.559704151633999)
......
9999(2, 0.9417920674984543, 0.9050446720499228)

The product syntax emphasizes that the random variables are defined on the same probability space (a product space). It is also possible to define each random variable separately and then use the AssumeIndependent command. The following code is equivalent to the above code. Either syntax has the effect of creating an unspecified probability space upon which random variables $X, Y, Z$ are defined via unspecified functions such that $X$, $Y$, and $Z$ are independent and have the specified marginal distributions.

In [7]:
X = RV(Binomial(5, 0.5))
Y = RV(Normal(0, 1))
Z = RV(Uniform(0, 1)) 
X, Y, Z = AssumeIndependent(X, Y, Z)

Random variables are independent and identically distribution (i.i.d.) when they are independent and have a common marginal distribution. For example, if V represents the number of heads in two flips of a penny and W the number of Heads in two flips of a dime, then V and W are i.i.d., with a common marginal Binomial(n=2, p=0.5) distribution. For i.i.d. random variables, defining the joint distribution using the "exponentiation" notation ** makes the code a little more compact.

Example. Let $X$ and $Y$ be i.i.d. Normal(0, 1) random variables.

In [8]:
X, Y = RV(Normal(0,1) ** 2)  # see below for notes on "unpacking"
(X & Y).sim(10000).plot(alpha = 0.01)

Random vectors

Recall that a random variable maps an outcome in a probability space to a real number. A random vector maps an outcome in a probability space to a vector of values. In other words, a random vector is a vector of random variables.

Each realization of a random vector is a tuple of values, rather than a single value. For example, a roll of two dice could return the pair of values (sum of the rolls, larger of the rolls). The RV class can be used to define random vectors as well as random variables.

Example. Suppose that a random vector X is formed by drawing two values independently, the first from a Binomial(5, 0.5) distribution and the second from a Normal(0, 1) distribution. Note that calling .sim() on X below generates pairs of values.

In [9]:
X = RV(Binomial(5, 0.5) * Normal(0, 1))
X.sim(3)
Out[9]:
Index Result
0(2, -0.07347117950505622)
1(4, 0.1776824745801523)
2(2, 1.1530403666034696)

Components of a random vector X can be accessed using brackets []. Note that Python starts the index at 0, so the first entry of a vector X is X[0], the second entry is X[1], etc. Each component of a random vector is a random variable so indexing using brackets produces a random variable which can be manipulated accordingly.

In [10]:
X = RV(Binomial(5, 0.5) * Normal(0, 1))
X[0].sim(10000).plot()

Brackets can be used to access components of the random vector itself, or the simulated values of a random vector

In [11]:
X = RV(Binomial(5, 0.5) * Normal(0, 1))
X.sim(10000)[1].plot()

"Unpacking"

Individual components of a random vector X can be accessed using brackets, e.g. X[0], X[1], etc. When a problem involves only a few random variables, it is typical to denote them as e.g. X, Y, Z (rather than X[0], X[1], X[2]). Components of a random vector can be "unpacked" in this way when defining an RV, allowing for more compact syntax.

Example. Let $X$, $Y$, and $Z$ be independent, with $X$ having a Binomial(5, 0.5) distribution, $Y$ a Normal(0,1) distribution, and $Z$ a Uniform(0,1) distribution. The following two cells provide two ways this situation can be defined. The first version is the "unpacked" definition which defines the three random variables. The second defines a random vector and then accesses each of its components with brackets.

In [12]:
# unpacked version
X, Y, Z = RV(Binomial(5, 0.5) * Normal(0,1) * Uniform(0,1))
Y.sim(10000).plot()
(X & Y & Z).sim(4)
Out[12]:
Index Result
0(2, -0.200451446170656, 0.057815075694642726)
1(3, 0.45618319891858616, 0.93834287355371)
2(2, -0.09791137681833872, 0.012096972502916081)
3(3, 0.4088040739387722, 0.7107173953405814)
In [13]:
# vector version
XYZ = RV(Binomial(5, 0.5) * Normal(0,1) * Uniform(0,1))
X = XYZ[0]
Y = XYZ[1]
Z = XYZ[2]
Y.sim(10000).plot()
XYZ.sim(4)
Out[13]:
Index Result
0(3, 0.5947805259942871, 0.7418898497777495)
1(1, -0.08903778410915292, 0.8168106262975037)
2(3, -0.972736806127421, 0.6763869498179407)
3(4, -1.6829702568612732, 0.7179177479263008)

Marginal distributions

Each component of a random vector is a random variable, so unpacking or indexing using brackets produces random variables which can be manipulated accordingly to describe their marginal distribution.

When multiple random variables are simulated, applying .mean(), .var(), or .sd() returns the marginal means, variances, and standard deviations, respectively, of each of the random variables involved.

Example. A vector of independent random variables.

In [14]:
X = RV(Binomial(5, 0.5) * Normal(0, 1) * Poisson(4))
X[2].sim(10000).plot()
In [15]:
X.sim(10000).mean()
Out[15]:
(2.5095000000000001, -0.0015713476953373871, 4.0102000000000002)
In [16]:
X.sim(10000).sd()
Out[16]:
(1.1212847943319535, 1.0067649707227069, 2.0003689634664052)

Example. A multivariate normal example, with "unpacking".

In [17]:
covmatrix = [[1, -0.5],
             [-0.5, 4]]
X, Y = RV(MultivariateNormal(mean = [0, 1], cov = covmatrix))
xy = (X & Y).sim(10000)
xy.mean()
Out[17]:
(0.005306236792298356, 0.99975848853662774)
In [18]:
xy.var()
Out[18]:
(1.0069917688899082, 3.9194556575620396)

Covariance

The covariance between random variables $X$ and $Y$, defined as $$ Cov(X,Y) = E[(X-E(X))(Y-E(Y))], $$ measures the degree of linear dependence between $X$ and $Y$. Covariance can be approximated by simulating many pairs of values of the random variables and using .cov().

Example. Let $X$ be the minimum and $Y$ the maximum of two independent Uniform(0,1) random variables. It can be shown that $Cov(X,Y) = 1/36$ (and the correlation is 1/2).

In [19]:
P = Uniform(a=0, b=1) ** 2
X = RV(P, min)
Y = RV(P, max)
xy = (X & Y).sim(10000)
plot(xy, alpha = 0.01)
xy.cov()
Out[19]:
0.027649786175647135

Example. A multivariate normal example.

In [20]:
covmatrix = [[1, -0.5],
             [-0.5, 4]]
X, Y = RV(MultivariateNormal(mean=[0, 1], cov=covmatrix)) # see below for notes on "unpacking"
xy = (X & Y).sim(10000)
xy.cov()
Out[20]:
-0.50879028805029503

When simulating more than two random variables, applying .cov() returns the covariance matrix of covariances between each pair of values (with the variances on the diagonal).

In [21]:
(X & Y & X+Y).sim(10000).cov()
Out[21]:
array([[ 1.01248616, -0.54089746,  0.4715887 ],
       [-0.54089746,  4.00041858,  3.45952112],
       [ 0.4715887 ,  3.45952112,  3.93110982]])

Correlation

The correlation coefficient is a standardized measure of linear dependence which takes values in $[-1, 1]$. $$ Corr(X,Y) = \frac{Cov (X,Y)}{\sqrt{Var(X)}\sqrt{Var(Y}} = Cov\left(\frac{X - E(X)}{SD(X)},\frac{Y - E(Y)}{SD(Y)}\right) $$ The correlation coefficient can be approximated by simulating many pairs of values and using .corr().

Example. A bivariate normal example.

In [22]:
X, Y = RV(BivariateNormal(mean1=0, mean2=1, sd1=1, sd2=2, corr=-0.25 ))
xy = (X & Y).sim(10000)
xy.corr()
Out[22]:
-0.24459822201203504

When simulating more than two random variables, applying .corr() returns the correlation matrix of correlations between each pair of values (with 1s on the diagonal since a variable is perfectly correlated with itself).

In [23]:
(X & Y & X+Y).sim(10000).corr()
Out[23]:
array([[ 1.        , -0.26250257,  0.2460193 ],
       [-0.26250257,  1.        ,  0.87069336],
       [ 0.2460193 ,  0.87069336,  1.        ]])

Transformations of random variables

Random variables are often defined as functions of other random variables. In particular, arithmetic operations like addition, subtraction, multiplication, and division can be applied to random variables defined on the same probability space.

Example. Two soccer teams score goals independently of each other, team A according to a Poisson distribution with mean 2.3 goals per goal and team B according to a Poisson distribution with mean 1.7 goals per game. Produce a plot of the approximate distribution of the total number of goals scored in a game.

In [24]:
X, Y = RV(Poisson(lam=2.3) * Poisson(lam=1.7))
Z = X + Y
Z.sim(10000).plot()

Example. The coordinates of a "random point" in the $(x, y)$ plane are random variables $X$ and $Y$ chosen independently of each other, each according to a Normal(0, 1) distribution. Produce a plot of the approximate distribution of $Z$, the distance of the $X, Y$ point from the origin.

In [25]:
X, Y = RV(Normal(0, 1) ** 2)
Z = sqrt(X ** 2 + Y ** 2)
Z.sim(10000).plot()

Example. Let $X$ and $Y$ be i.i.d. Exponential(1) random variables. Produce a plot of the approximate joint distribution of $W = X+ Y$ and $Z = X / W$.

In [26]:
X, Y = RV(Exponential(1)**2)
W = X + Y
Z = X / W
(W & Z).sim(10000).plot(alpha = 0.05)

A caution about working with multiple random variables

In order to manipulate multiple random variables simulultaneously, they must be defined on the sample probability space. Otherwise, it would not be possible to determine the relationship between the random variables. Note that the following code would produce an error because the random variables are not explicitly defined on the same probability space. In particular, Symbulate has no way of determining the joint distribution of $X$ and $Y$. The error can be fixed by adding X, Y = AssumeIndependent(X, Y) before the last line, after which the code would be equivalent to the code above in the soccer example above.

X = RV(Poisson(2.3))  
Y = RV(Poisson(1.7))  
(X + Y).sim(10000).plot()

If it is desired to define independent random variables, the independence must be made explicit, either with the product syntax * (or **) or with AssumeIndependent. (The AssumeIndependent command has the effect of defining the random variables involved on the same probability space, a product space formed from the marginal distributions.)