&
.plot()
from symbulate import *
%matplotlib inline
Many problems involve several random variables defined on the same probability space. Of interest are properties of the joint distribution which describe the relationship between the random variables.
In the context of multiple random variables, the distribution of any single random variable is referred to as a marginal distribution. Joint distributions fully specify the corresponding marginal distributions; however, the converse is not true (unless the random variables are independent.)
Example. Roll two fair six-sided dice. Let $X$ be the sum of the two dice and $Y$ the larger of the two rolls.
die = list(range(1, 6 + 1))
P = BoxModel(die, size=2)
X = RV(P, sum)
Y = RV(P, max)
&
¶Joining X
and Y
with an ampersand &
and calling .sim()
simultaneously simulates the pair of (X, Y)
values for each simulated outcome of the probability space. The simulated results can be used to approximate the joint distribution of X
and Y
which describes the possible pairs of values and their relative likelihoods. Likewise, tuples of values of multiple random variables can be simulated simultaneously using the ampersand &
. Simulation tools like .sim()
, .tabulate()
, etc work as before.
Example. Roll two fair six-sided dice. Let $X$ be the sum of the two dice and $Y$ the larger of the two rolls.
die = list(range(1, 6 + 1))
P = BoxModel(die, size=2)
X = RV(P, sum)
Y = RV(P, max)
(X & Y).sim(10000).tabulate(normalize=True)
.plot()
¶Calling .plot()
for simulated (X, Y)
pairs produces a scatterplot summary of the simulated values. When the variables are discrete, it is recommended to use the jitter=True
option to better visualize relative frequencies. The alpha=
parameter controls the level of transparency.
Example. Roll two fair six-sided dice. Let $X$ be the sum of the two dice and $Y$ the larger of the two rolls.
die = list(range(1, 6 + 1))
P = BoxModel(die, size=2)
X = RV(P, sum)
Y = RV(P, max)
(X & Y).sim(10000).plot(jitter = True, alpha = 0.01)
See the section on Symbulate graphics for more details on plotting options and functionality.
Recall that a RV
can be defined by specifying its distribution directly. Similarly, multiple RV
s can be defined by specifying the joint distribution directly. Several commonly used joint distributions are built in to Symbulate. For example, a multivariate normal distribution is a joint distribution parametrized by a mean vector and a covariance matrix.
covmatrix = [[1, -0.5],
[-0.5, 4]]
X, Y = RV(MultivariateNormal(mean = [0, 1], cov = covmatrix)) # see below for notes on "unpacking"
Custom joint distributions can be specified using ProbabilitySpace. For example, it is possible to specify a joint distribution via conditional and marginal distributions.
Intuitvely, a collection of random variables are independent if knowing the values of some does not influence the joint distribution of the others. Random variables $X$ and $Y$ are independent if and only if the joint distribution factors into the product of the corresponding marginal distributions. That is, for independent RVs the joint distribution is fully specified by the marginal distributions.
Recall that a RV
can be defined by specifying its distribution directly. When dealing with multiple random variables it is common to specify the marginal distribution of each and assume independence. In Symbulate, independence of distributions is represented by the asterisks *
. The *
syntax reflects that under independence joints objects (i.e. cdf, pdf) are products of the corresponding marginal objects.
Example. Let $X$, $Y$, and $Z$ be independent, with $X$ having a Binomial(5, 0.5) distribution, $Y$ a Normal(0,1) distribution, and $Z$ a Uniform(0,1) distribution.
X, Y, Z = RV(Binomial(5, 0.5) * Normal(0, 1) * Uniform(0, 1)) # see below for notes on "unpacking"
(X & Y & Z).sim(10000)
The product syntax emphasizes that the random variables are defined on the same probability space (a product space). It is also possible to define each random variable separately and then use the AssumeIndependent
command. The following code is equivalent to the above code. Either syntax has the effect of creating an unspecified probability space upon which random variables $X, Y, Z$ are defined via unspecified functions such that $X$, $Y$, and $Z$ are independent and have the specified marginal distributions.
X = RV(Binomial(5, 0.5))
Y = RV(Normal(0, 1))
Z = RV(Uniform(0, 1))
X, Y, Z = AssumeIndependent(X, Y, Z)
Random variables are independent and identically distribution (i.i.d.) when they are independent and have a common marginal distribution. For example, if V
represents the number of heads in two flips of a penny and W
the number of Heads in two flips of a dime, then V
and W
are i.i.d., with a common marginal Binomial(n=2, p=0.5)
distribution. For i.i.d. random variables, defining the joint distribution using the "exponentiation" notation **
makes the code a little more compact.
Example. Let $X$ and $Y$ be i.i.d. Normal(0, 1) random variables.
X, Y = RV(Normal(0,1) ** 2) # see below for notes on "unpacking"
(X & Y).sim(10000).plot(alpha = 0.01)
Recall that a random variable maps an outcome in a probability space to a real number. A random vector maps an outcome in a probability space to a vector of values. In other words, a random vector is a vector of random variables.
Each realization of a random vector is a tuple of values, rather than a single value. For example, a roll of two dice could return the pair of values (sum of the rolls, larger of the rolls). The RV
class can be used to define random vectors as well as random variables.
Example. Suppose that a random vector X
is formed by drawing two values independently, the first from a Binomial(5, 0.5) distribution and the second from a Normal(0, 1) distribution. Note that calling .sim()
on X
below generates pairs of values.
X = RV(Binomial(5, 0.5) * Normal(0, 1))
X.sim(3)
Components of a random vector X can be accessed using brackets []
. Note that Python starts the index at 0, so the first entry of a vector X
is X[0]
, the second entry is X[1]
, etc. Each component of a random vector is a random variable so indexing using brackets produces a random variable which can be manipulated accordingly.
X = RV(Binomial(5, 0.5) * Normal(0, 1))
X[0].sim(10000).plot()
Brackets can be used to access components of the random vector itself, or the simulated values of a random vector
X = RV(Binomial(5, 0.5) * Normal(0, 1))
X.sim(10000)[1].plot()
Individual components of a random vector X can be accessed using brackets, e.g. X[0]
, X[1]
, etc. When a problem involves only a few random variables, it is typical to denote them as e.g. X
, Y
, Z
(rather than X[0]
, X[1]
, X[2]
). Components of a random vector can be "unpacked" in this way when defining an RV, allowing for more compact syntax.
Example. Let $X$, $Y$, and $Z$ be independent, with $X$ having a Binomial(5, 0.5) distribution, $Y$ a Normal(0,1) distribution, and $Z$ a Uniform(0,1) distribution. The following two cells provide two ways this situation can be defined. The first version is the "unpacked" definition which defines the three random variables. The second defines a random vector and then accesses each of its components with brackets.
# unpacked version
X, Y, Z = RV(Binomial(5, 0.5) * Normal(0,1) * Uniform(0,1))
Y.sim(10000).plot()
(X & Y & Z).sim(4)
# vector version
XYZ = RV(Binomial(5, 0.5) * Normal(0,1) * Uniform(0,1))
X = XYZ[0]
Y = XYZ[1]
Z = XYZ[2]
Y.sim(10000).plot()
XYZ.sim(4)
Each component of a random vector is a random variable, so unpacking or indexing using brackets produces random variables which can be manipulated accordingly to describe their marginal distribution.
When multiple random variables are simulated, applying .mean()
, .var()
, or .sd()
returns the marginal means, variances, and standard deviations, respectively, of each of the random variables involved.
Example. A vector of independent random variables.
X = RV(Binomial(5, 0.5) * Normal(0, 1) * Poisson(4))
X[2].sim(10000).plot()
X.sim(10000).mean()
X.sim(10000).sd()
Example. A multivariate normal example, with "unpacking".
covmatrix = [[1, -0.5],
[-0.5, 4]]
X, Y = RV(MultivariateNormal(mean = [0, 1], cov = covmatrix))
xy = (X & Y).sim(10000)
xy.mean()
xy.var()
The covariance between random variables $X$ and $Y$, defined as
$$
Cov(X,Y) = E[(X-E(X))(Y-E(Y))],
$$
measures the degree of linear dependence between $X$ and $Y$. Covariance can be approximated by simulating many pairs of values of the random variables and using .cov()
.
Example. Let $X$ be the minimum and $Y$ the maximum of two independent Uniform(0,1) random variables. It can be shown that $Cov(X,Y) = 1/36$ (and the correlation is 1/2).
P = Uniform(a=0, b=1) ** 2
X = RV(P, min)
Y = RV(P, max)
xy = (X & Y).sim(10000)
plot(xy, alpha = 0.01)
xy.cov()
Example. A multivariate normal example.
covmatrix = [[1, -0.5],
[-0.5, 4]]
X, Y = RV(MultivariateNormal(mean=[0, 1], cov=covmatrix)) # see below for notes on "unpacking"
xy = (X & Y).sim(10000)
xy.cov()
When simulating more than two random variables, applying .cov()
returns the covariance matrix of covariances between each pair of values (with the variances on the diagonal).
(X & Y & X+Y).sim(10000).cov()
The correlation coefficient is a standardized measure of linear dependence which takes values in $[-1, 1]$.
$$
Corr(X,Y) = \frac{Cov (X,Y)}{\sqrt{Var(X)}\sqrt{Var(Y}} = Cov\left(\frac{X - E(X)}{SD(X)},\frac{Y - E(Y)}{SD(Y)}\right)
$$
The correlation coefficient can be approximated by simulating many pairs of values and using .corr()
.
Example. A bivariate normal example.
X, Y = RV(BivariateNormal(mean1=0, mean2=1, sd1=1, sd2=2, corr=-0.25 ))
xy = (X & Y).sim(10000)
xy.corr()
When simulating more than two random variables, applying .corr()
returns the correlation matrix of correlations between each pair of values (with 1s on the diagonal since a variable is perfectly correlated with itself).
(X & Y & X+Y).sim(10000).corr()
Random variables are often defined as functions of other random variables. In particular, arithmetic operations like addition, subtraction, multiplication, and division can be applied to random variables defined on the same probability space.
Example. Two soccer teams score goals independently of each other, team A according to a Poisson distribution with mean 2.3 goals per goal and team B according to a Poisson distribution with mean 1.7 goals per game. Produce a plot of the approximate distribution of the total number of goals scored in a game.
X, Y = RV(Poisson(lam=2.3) * Poisson(lam=1.7))
Z = X + Y
Z.sim(10000).plot()
Example. The coordinates of a "random point" in the $(x, y)$ plane are random variables $X$ and $Y$ chosen independently of each other, each according to a Normal(0, 1) distribution. Produce a plot of the approximate distribution of $Z$, the distance of the $X, Y$ point from the origin.
X, Y = RV(Normal(0, 1) ** 2)
Z = sqrt(X ** 2 + Y ** 2)
Z.sim(10000).plot()
Example. Let $X$ and $Y$ be i.i.d. Exponential(1) random variables. Produce a plot of the approximate joint distribution of $W = X+ Y$ and $Z = X / W$.
X, Y = RV(Exponential(1)**2)
W = X + Y
Z = X / W
(W & Z).sim(10000).plot(alpha = 0.05)
In order to manipulate multiple random variables simulultaneously, they must be defined on the sample probability space. Otherwise, it would not be possible to determine the relationship between the random variables. Note that the following code would produce an error because the random variables are not explicitly defined on the same probability space. In particular, Symbulate has no way of determining the joint distribution of $X$ and $Y$. The error can be fixed by adding X, Y = AssumeIndependent(X, Y)
before the last line, after which the code would be equivalent to the code above in the soccer example above.
X = RV(Poisson(2.3))
Y = RV(Poisson(1.7))
(X + Y).sim(10000).plot()
If it is desired to define independent random variables, the independence must be made explicit, either with the product syntax *
(or **
) or with AssumeIndependent
. (The AssumeIndependent
command has the effect of defining the random variables involved on the same probability space, a product space formed from the marginal distributions.)