In the SciPy stack, we have two means for determining probability: a symbolic setting and a numerical setting. In this brief section, we are going to compare both with a sequence of examples.
For the symbolic treatment of random variables, we employ the module sympy.stats
, while for the numerical treatment, we use the module scipy.stats
. In both cases, the goal is the same—the instantiation of any random variable, and the following three kinds of operations on them:
Let's observe several situations through the scope of the two different settings.
Let's start with discrete random variables. For instance, let's consider several random variables used to describe the process of rolling three 6-sided dice, one 100-sided dice, and the possible outcomes:
In [1]: from sympy import var; ...: from sympy.stats import Die, sample_iter, P, variance, ...: std, E, moment, cdf, density, ...: Exponential, skewness In [2]: D6_1, D6_2, D6_3 = Die('D6_1', 6), Die('D6_2', 6), ...: Die('D6_3', 6); ...: D100 = Die('D100', 100); ...: X = D6_1 + D6_2 + D6_3 + D100
We run a simulation, where we cast those four dice 20 times, and collect the sum of each throw:
In [3]: for item in sample_iter(X, numsamples=20): ...: print item, ...: 45 50 84 43 44 84 102 38 90 94 35 78 67 54 20 64 62 107 59 84
Let's illustrate how easily we can compute probabilities associated with these variables. For instance, to calculate the probability that the sum of the three 6-sided dice amount to a smaller number than the throw of the 100-sided dice can be obtained as follows:
In [4]: P(D6_1 + D6_2 + D6_3 < D100) Out[4]: 179/200
Conditional probabilities are also realizable, such as, "What is the probability of obtaining at least a 10 when throwing two 6-sided dice, if the first one shows a 5?":
In [5]: from sympy import Eq # Don't use == with symbolic objects! In [6]: P(D6_1 + D6_2 > 9, Eq(D6_1, 5)) Out[6]: 1/3
The computation of parameters of the associated probability distributions is also very simple. In the following session, we obtain the variance, standard deviation, and expected value of X
, together with some other higher-order moments of this variable around zero:
In [7]: variance(X), std(X), E(X) Out[7]: (842, sqrt(842), 61) In [8]: for n in range(2,10): ...: print "mu_{0} = {1}".format(n, moment(X, n, 0)) ...: mu_2 = 4563 mu_3 = 381067 mu_4 = 339378593/10 mu_5 = 6300603685/2 mu_6 = 1805931466069/6 mu_7 = 176259875749813/6 mu_8 = 29146927913035853/10 mu_9 = 586011570997109973/2
We can easily compute the probability mass function and cumulative density function too:
In [9]: cdf(X) In [10]: density(X) Out[9]: Out[10]: {4: 1/21600, {4: 1/21600, 5: 1/4320, 5: 1/5400, 6: 1/1440, 6: 1/2160, 7: 7/4320, 7: 1/1080, 8: 7/2160, 8: 7/4320, 9: 7/1200, 9: 7/2700, 10: 23/2400, 10: 3/800, 11: 7/480, 11: 1/200, 12: 1/48, 12: 1/160, 13: 61/2160, 13: 1/135, 14: 791/21600, 14: 181/21600, 15: 329/7200, 15: 49/5400, 16: 1193/21600, 16: 103/10800, 17: 281/4320, 17: 53/5400, 18: 3/40, 18: 43/4320, ... 102: 183/200, 102: 1/100, 103: 37/40, 103: 1/100, 104: 4039/4320, 104: 43/4320, 105: 20407/21600, 105: 53/5400, 106: 6871/7200, 106: 103/10800, 107: 20809/21600, 107: 49/5400, 108: 2099/2160, 108: 181/21600, 109: 47/48, 109: 1/135, 110: 473/480, 110: 1/160, 111: 2377/2400, 111: 1/200, 112: 1193/1200, 112: 3/800, 113: 2153/2160, 113: 7/2700, 114: 4313/4320, 114: 7/4320, 115: 1439/1440, 115: 1/1080, 116: 4319/4320, 116: 1/2160, 117: 21599/21600, 117: 1/5400, 118: 1} 118: 1/21600}
Let's move onto continuous random variables. This short session computes the density and cumulative distribution function, as well as several parameters, of a generic exponential random variable:
In [11]: var('mu', positive=True); ....: var('t'); ....: X = Exponential('X', mu) In [12]: density(X)(t) Out[12]: mu*exp(-mu*t) In [13]: cdf(X)(t) Out[13]: Piecewise((1 - exp(-mu*t), t >= 0), (0, True)) In [14]: variance(X), skewness(X) Out[14]: (mu**(-2), 2) In [15]: [moment(X, n, 0) for n in range(1,10)] Out[15]: [1/mu, 2/mu**2, 6/mu**3, 24/mu**4, 120/mu**5, 720/mu**6, 5040/mu**7, 40320/mu**8, 362880/mu**9]
For a complete description of the module sympy.stats
with an exhaustive enumeration of all its implemented random variables, a good reference is the official documentation online at http://docs.sympy.org/dev/modules/stats.html.
The description of a discrete random variable in the numerical setting is performed by the implementation of an object rv_discrete
from the module scipy.stats
. This object has the following methods:
object.rvs
to obtain samplesobject.pmf
and object.logpmf
to compute the probability mass function and its logarithm, respectivelyobject.cdf
and object.logcdf
to compute the cumulative density function and its logarithm, respectivelyobject.sf
and object.logsf
to compute the survival function (1-cdf) and its logarithm, respectivelyobject.ppf
and object.isf
to compute the percent point function (the inverse of the CDF) and the inverse of the survival functionobject.expect
and object.moment
to compute expected value and other momentsobject.entropy
to compute entropyobject.median
, object.mean
, object.var
, and object.std
to compute the basic parameters (which can also be accessed with the method object.stats
)object.interval
to compute an interval with a given probability that contains a random realization of the distributionWe can then simulate an experiment with dice, similar to the previous section. In this setting, we represent dice by a uniform distribution on the set of the dice sides:
In [1]: import numpy as np, matplotlib.pyplot as plt; ...: from scipy.stats import randint, gaussian_kde, rv_discrete In [2]: D6 = randint(1, 7); ...: D100 = randint(1, 101)
Symbolically, it was very simple to construct the sum of these four independent random variables. Numerically, we address the situation in a different way. Assume for a second that we do not know the kind of random variable we are to obtain. Our first step is usually to create a big sample—10,000 throws in this case, and produce a histogram with the results:
In [3]: samples = D6.rvs(10000) + D6.rvs(10000) ...: + D6.rvs(10000) + D100.rvs(10000) In [4]: plt.hist(samples, bins=118-4); ...: plt.xlabel('Sum of dice'); ...: plt.ylabel('Frequency of each sum'); ...: plt.show()
This gives the following screenshot that clearly indicates that our new random variable is not uniform:
One way to approach this problem is to approximate the distribution of the variable from this data, and for that task, we use from the scipy.stats
module the function gaussian_kde
, which performs a kernel-density estimate using Gaussian kernels:
In [5]: kernel = gaussian_kde(samples)
This gaussian_kde
object has methods similar to those of an actual random variable. To estimate the value of the corresponding probability of getting a 50, and the probability of obtaining a number greater than 100 in a throw of these four dice, we would issue, respectively:
In [6]: kernel(50) # The actual answer is 1/100 Out[6]: array([ 0.00970843]) In [7]: kernel.integrate_box_1d(0,100) # The actual answer is 177/200 Out[7]: 0.88395064140531865
Instead of estimating this sum of random variables, and again assuming we are not familiar with the actual result, we could create an actual random variable by defining its probability mass function in terms of the probability mass functions of the summands. The key? Convolution, of course, since the random variables for these dice are independent. The sample space is the set of numbers from 4 to 118 (space_sum
in the following command), and the probabilities associated with each element (probs_sum
) are computed as the convolution of the corresponding probabilities for each dice on their sample spaces:
In [8]: probs_6dice = D6.pmf(np.linspace(1,6,6)); ...: probs_100dice = D100.pmf(np.linspace(1,100,100)) In [9]: probs_sum = np.convolve(np.convolve(probs_6dice,probs_6dice), ...: np.convolve(probs_6dice,probs_100dice)); ...: space_sum = np.linspace(4, 118, 115) In [10]: sum_of_dice = rv_discrete(name="sod", ....: values=(space_sum, probs_sum)) In [11]: sum_of_dice.pmf(50) Out[11]: 0.0099999999999999985 In [12]: sum_of_dice.cdf(100) Out[12]: 0.89500000000000057
3.15.137.75