A random variable is a function that takes as an input an outcome of a probability space and returns a real number. Commonly encountered random variables are either discrete or continuous.
Values of a random variable can be simulated and analyzed much like the outcomes of a probability space. However, several additional tools are available for random variables.
from symbulate import *
%matplotlib inline
The RV
class defines random variables on a probability space. A random variable is a function which maps each outcome of a probability space to a real number; therefore a probability space must be defined first.
Example. Consider the random variable $X$ representing the number of Heads in 5 flips of a fair coin. With 1 representing Heads and 0 Tails, we can count the number of Heads by summing the 0/1 values. We define a probability space corresponding to 5 flips of a fair coin, and then a RV on that space which takes an outcome and sums its values.
P = BoxModel([0, 1], size=5)
X = RV(P, sum)
A random variable can be called as a function to return its value for a particular outcome in the probability space.
outcome = (0, 1, 0, 1, 1)
X(outcome)
If no function is specified in RV
, the default is the identity function, f(x) = x
.
P = Normal(mean=0, sd=1)
X = RV(P)
X(-0.5)
Like the probability space defintions, the RV
command only defines a random variable. Simulation tools like .sim()
, .tabulate()
, .filter()
, and .count()
work for RV
just like for ProbabilitySpace
.
P = BoxModel([0, 1], size=5)
X = RV(P, sum)
values = X.sim(10000)
values
values.tabulate(normalize=True)
values.count_gt(3) / 10000
Random variables can also be specified by user defined functions.
Example. Ten cards labeled 1, 2, $\ldots$ , 10 are shuffled and dealt one at a time. Let $X$ be the number of the cards (out of ten) for which the number on the card matches its position in the deal. (For example, a match occurs if card 3 is the third card dealt.)
n = 10
labels = list(range(n))
# remember, Python starts the index at 0, so the cards are labebeled 0, ..., 9
def number_matches(x):
count = 0
for i in range(n):
if x[i] == labels[i]:
count += 1
return count
P = BoxModel(labels, size=n, replace=False)
X = RV(P, number_matches)
X.sim(10000)
An event is a set of outcomes in the probability space. For example, if the random variable $X$ represents the number of Heads in 5 flips of a fair coin, then the event $\{X>3\}$ is the collection of Heads/Tails sequences with more than 3 Heads. For syntactical reasons, in Symbulate events must be enclosed in parentheses rather than braces, e.g (X > 3)
. Events like (X > 3)
can be simulated; calling .sim()
on an event returns True
for outcomes where the event occurs and False
otherwise. Simulating events provides another method for estimating probabilities.
Example. If $X$ represents the number of Heads in 5 flips of a fair coin, the following is one method for estimating $P(X > 3)$, the probability of more than 3 Heads in 5 flips of a fair coin. (The true probability is 0.1875.)
P = BoxModel([0, 1], size=5)
X = RV(P, sum)
(X > 3).sim(10000).tabulate()
.plot()
¶The .tabulate()
method provides a tabular summary of the simulated values of an RV
. A plot summary can be obtained with .plot()
. The type of plot returned depends on the type of the random variable, discrete or continuous.
For discrete random variables, .plot()
generally returns an "impulse" (or "spike") plot of the relative frequencies for each of the realized values. (Frequencies (counts) can be obtained using the normalize=False
option.)
P = BoxModel([0, 1], size=5)
X = RV(P, sum)
X.sim(10000).plot()
For continuous random variables, .plot()
returns a histogram, which provides the frequencies of simulated values falling in interval "bins". By default, the vertical axis of the histogram is on the density scale, so that areas of the bars correspond to relative frequencies. (Bin frequencies (counts) can be obtained using the normalize=False
option.)
P = Normal(mean=0, sd=1)
X = RV(P)
X.sim(10000).plot()
See the section on Symbulate graphics for more details on plotting options and functionality.
The .tabulate()
and .plot()
methods provide summaries of the distribution of a random variable. The distribution of a random variable is a probability model on the real line which describes the possible values that the random variable can take and their relative likelihoods. The distribution of a random variable is determined by (1) the underlying probability space on which the random variable is defined and (2) the function (mapping the probability space outcomes to real numbers) that defines the random variable. For example, if $X$ represents the number of Heads in 5 flips of a fair coin, we say that $X$ has a Binomial(n=5, p=0.5)
distribution.
It also common to specify the distribution of a random variable directly. For example, a problem may state "let $X$ have a Binomial(5, 0.5) distribution" without mention of the underlying probabilty space or random variable. RV
can also be used to define a random variable by specifying its distribution. (Many common probability distributions are built into Symbulate.)
X = RV(Binomial(n=5, p=0.5))
X.sim(10000).plot()
By definition, a random variable must always be a function defined on a probability space. Specifying a random variable by specifying its distribution, as in X = RV(Binomial(n=5, p=0.5))
, has the effect of defining the probability space to be the distribution of the random variable and the function defined on this space to be the identity (f(x) = x
). However, it is more appropriate to think of such a specification as defining a random variable with the given distribution on an unspecified probability space through an unspecified function.
Random variables and distributions are two distinct concepts. For example, the random variables in each of the following situations has a Binomial(5, 0.5) distribution.
Each of these situations involves a different probability space (coins, dice, births) with a random variable which counts according to different criteria (Heads, Tails, evens, boys). These examples illustrate that knowledge that a random variable has a specific distribution (e.g. Binomial(5, 0.5)) does not necessarily convey any information about the underlying observational units or variable being measured. This is why we say a specification like X = RV(Binomial(n=5, p=0.5))
defines a random variable X
on an unspecified probability space via an unspecified function.
It is important not to confuse a random variable with its distribution. For example, suppose that a coin is flipped 5 times and $X$ represents the number of Heads and $Y$ the number of Tails in the 5 flips (so $Y = 5 - X$). Note that $X$ and $Y$ are two different random variables; they measure different things. For example, if the outcome of the flips is (1, 0, 0, 1, 0)
then $X = 2$ but $Y = 3$. In fact, in this example the values of $X$ and $Y$ are unequal for every outcome in the probability space (that is, $P(X = Y) = 0$).
P = BoxModel([1, 0], size = 5)
X = RV(P, sum)
Y = 5 - X
(X == Y).sim(10000).tabulate()
However, while $X$ and $Y$ are two different random variables, they do have the same distribution over many outcomes.
X.sim(10000).plot()
Y.sim(10000).plot(jitter = True)
A transformation of a random variable is also a random variable. If $X$ is a random variable and $g$ is function, define a new random variable $Y=g(X)$ in Symbulate using the .apply()
method, e.g. Y = X.apply(g)
. An RV
defined via a transformation of another RV
behaves like any Symbulate RV
.
Note that for arithmetic operations and many math functions (such as exp
, log
, sin
) you can simply call g(X)
rather than X.apply(g)
. User-defined functions can also be used to define or transform random variables.
Example. If $X$ has a normal distribution, then $e^X$ has a lognormal distribution.
X = RV(Normal(mean=0, var=1))
Y = exp(X) # same as X.apply(exp)
Y.sim(10000).plot()
Example. If $U$ has a Uniform(0,1) distribution then $-\log(1-U)$ has an Exponential(1) distribution.
U = RV(Uniform(a=0, b=1))
X = -log(1 - U)
X.sim(10000).plot(alpha=0.5)
RV(Exponential(1)).sim(10000).plot(alpha=0.5)
Example. A user defined function with .apply()
def g(x):
return min(abs(x - 1), 2)**2
X = RV(Exponential(rate = 1))
Y = X.apply(g)
Y.sim(10000).plot()
The expected value of a random variable, $E(X)$, is its probability-weighted average value.
The expected value of a discrete random variable $X$ with probability mass function $p_X(x) = P(X=x)$ is
$$
E(X) = \sum_{x} x \, p_X(x)
$$
The expected value $E(X)$ of a continuous random variable $X$ with probability density function $f_X$ is
$$
E(X) = \int\limits_{-\infty}\limits^{+\infty} x \, f_{X}(x)\, dx
$$
An expected value can be approximated by simulating many values of the random variable and finding their sample mean (i.e. average) using .mean()
.
Example. If $X$ has a geometric distribution with $p = 0.1$ its expected value is $1/0.1 = 10$.
RV(Geometric(p=0.1)).sim(10000).mean()
If x denotes simulated values, then either x.mean()
or mean(x)
will return the mean of the values.
x = RV(Geometric(p=0.1)).sim(10000)
mean(x)
Recall that output can be formatted using print statements.
print('An estimate of E(X) is: {:.2f}'.format(mean(x)))
The variance of a random variable is the probability-weighted average of the squared deviation from the expected value:
$$
Var(X) = E\left[\left( X-E(X)\right) ^2\right].
$$
Similar to expected value, variance can be approximated by simulating many values of the random variable and finding their sample variance, using .var()
. (As with mean, both x.var()
or var(x)
are allowed.)
Example. If $X$ has an exponential distribution with rate parameter $\lambda=1/4$ its variance is 16.
X = RV(Exponential(rate=1/4))
x = X.sim(10000)
x.var()
Using .var()
is equivalent to the following.
mean((x - x.mean()) ** 2)
Standard deviation, .sd()
, is the square root of the variance. (As with mean, both x.sd()
or sd(x)
are allowed.)
x.sd()
sqrt(x.var())
A standardized variable is a variable that has been rescaled to have a mean of zero and a standard deviation of one. Standardization is achieved by subtracting the mean from each value and then dividing by the standard deviation. In Symbulate, the .standardize()
method standardizes the simulated values of a random variable.
X = RV(Normal(mean=3, sd=2))
x = X.sim(10000)
x.mean(), x.sd()
z = x.standardize()
x.plot()
z.plot()
z.mean(), z.sd()