Probability

In the SciPy stack, we have two means for determining probability: a symbolic setting and a numerical setting. In this brief section, we are going to compare both with a sequence of examples.

For the symbolic treatment of random variables, we employ the module sympy.stats, while for the numerical treatment, we use the module scipy.stats. In both cases, the goal is the same—the instantiation of any random variable, and the following three kinds of operations on them:

  • Description of the probability distribution of a random variable with numbers (parameters).
  • Description of a random variable in terms of functions.
  • Computation of associated probabilities.

Let's observe several situations through the scope of the two different settings.

Symbolic setting

Let's start with discrete random variables. For instance, let's consider several random variables used to describe the process of rolling three 6-sided dice, one 100-sided dice, and the possible outcomes:

In [1]: from sympy import var; 
   ...: from sympy.stats import Die, sample_iter, P, variance, 
   ...:                         std, E, moment, cdf, density, 
   ...:                         Exponential, skewness
In [2]: D6_1, D6_2, D6_3 = Die('D6_1', 6), Die('D6_2', 6), 
   ...:                    Die('D6_3', 6); 
   ...: D100 = Die('D100', 100); 
   ...: X = D6_1 + D6_2 + D6_3 + D100

We run a simulation, where we cast those four dice 20 times, and collect the sum of each throw:

In [3]: for item in sample_iter(X, numsamples=20):
   ...:     print item,
   ...:
45 50 84 43 44 84 102 38 90 94 35 78 67 54 20 64 62 107 59 84

Let's illustrate how easily we can compute probabilities associated with these variables. For instance, to calculate the probability that the sum of the three 6-sided dice amount to a smaller number than the throw of the 100-sided dice can be obtained as follows:

In [4]: P(D6_1 + D6_2 + D6_3 < D100)
Out[4]: 179/200

Conditional probabilities are also realizable, such as, "What is the probability of obtaining at least a 10 when throwing two 6-sided dice, if the first one shows a 5?":

In [5]: from sympy import Eq   # Don't use == with symbolic objects!
In [6]: P(D6_1 + D6_2 > 9, Eq(D6_1, 5))
Out[6]: 1/3

The computation of parameters of the associated probability distributions is also very simple. In the following session, we obtain the variance, standard deviation, and expected value of X, together with some other higher-order moments of this variable around zero:

In [7]: variance(X), std(X), E(X)
Out[7]: (842, sqrt(842), 61)
In [8]: for n in range(2,10):
   ...:     print "mu_{0} = {1}".format(n, moment(X, n, 0))
   ...:
mu_2 = 4563
mu_3 = 381067
mu_4 = 339378593/10
mu_5 = 6300603685/2
mu_6 = 1805931466069/6
mu_7 = 176259875749813/6
mu_8 = 29146927913035853/10
mu_9 = 586011570997109973/2

We can easily compute the probability mass function and cumulative density function too:

In [9]: cdf(X)                         In [10]: density(X)
Out[9]:                                Out[10]:
{4: 1/21600,                           {4: 1/21600,
 5: 1/4320,                             5: 1/5400,
 6: 1/1440,                             6: 1/2160,
 7: 7/4320,                             7: 1/1080,
 8: 7/2160,                             8: 7/4320,
 9: 7/1200,                             9: 7/2700,
 10: 23/2400,                           10: 3/800,
 11: 7/480,                             11: 1/200,
 12: 1/48,                              12: 1/160,
 13: 61/2160,                           13: 1/135,
 14: 791/21600,                         14: 181/21600,
 15: 329/7200,                          15: 49/5400,
 16: 1193/21600,                        16: 103/10800,
 17: 281/4320,                          17: 53/5400,
 18: 3/40,                              18: 43/4320,

...

 102: 183/200,                          102: 1/100,
 103: 37/40,                            103: 1/100,
 104: 4039/4320,                        104: 43/4320,
 105: 20407/21600,                      105: 53/5400,
 106: 6871/7200,                        106: 103/10800,
 107: 20809/21600,                      107: 49/5400,
 108: 2099/2160,                        108: 181/21600,
 109: 47/48,                            109: 1/135,
 110: 473/480,                          110: 1/160,
 111: 2377/2400,                        111: 1/200,
 112: 1193/1200,                        112: 3/800,
 113: 2153/2160,                        113: 7/2700,
 114: 4313/4320,                        114: 7/4320,
 115: 1439/1440,                        115: 1/1080,
 116: 4319/4320,                        116: 1/2160,
 117: 21599/21600,                      117: 1/5400,
 118: 1}                                118: 1/21600}

Let's move onto continuous random variables. This short session computes the density and cumulative distribution function, as well as several parameters, of a generic exponential random variable:

In [11]: var('mu', positive=True); 
   ....: var('t'); 
   ....: X = Exponential('X', mu)
In [12]: density(X)(t)
Out[12]: mu*exp(-mu*t)
In [13]: cdf(X)(t)
Out[13]: Piecewise((1 - exp(-mu*t), t >= 0), (0, True))
In [14]: variance(X), skewness(X)
Out[14]: (mu**(-2), 2)
In [15]: [moment(X, n, 0) for n in range(1,10)]
Out[15]:
[1/mu,
 2/mu**2,
 6/mu**3,
 24/mu**4,
 120/mu**5,
 720/mu**6,
 5040/mu**7,
 40320/mu**8,
 362880/mu**9]

Tip

For a complete description of the module sympy.stats with an exhaustive enumeration of all its implemented random variables, a good reference is the official documentation online at http://docs.sympy.org/dev/modules/stats.html.

Numerical setting

The description of a discrete random variable in the numerical setting is performed by the implementation of an object rv_discrete from the module scipy.stats. This object has the following methods:

  • object.rvs to obtain samples
  • object.pmf and object.logpmf to compute the probability mass function and its logarithm, respectively
  • object.cdf and object.logcdf to compute the cumulative density function and its logarithm, respectively
  • object.sf and object.logsf to compute the survival function (1-cdf) and its logarithm, respectively
  • object.ppf and object.isf to compute the percent point function (the inverse of the CDF) and the inverse of the survival function
  • object.expect and object.moment to compute expected value and other moments
  • object.entropy to compute entropy
  • object.median, object.mean, object.var, and object.std to compute the basic parameters (which can also be accessed with the method object.stats)
  • object.interval to compute an interval with a given probability that contains a random realization of the distribution

We can then simulate an experiment with dice, similar to the previous section. In this setting, we represent dice by a uniform distribution on the set of the dice sides:

In [1]: import numpy as np, matplotlib.pyplot as plt; 
   ...: from scipy.stats import randint, gaussian_kde, rv_discrete
In [2]: D6 = randint(1, 7); 
   ...: D100 = randint(1, 101)

Symbolically, it was very simple to construct the sum of these four independent random variables. Numerically, we address the situation in a different way. Assume for a second that we do not know the kind of random variable we are to obtain. Our first step is usually to create a big sample—10,000 throws in this case, and produce a histogram with the results:

In [3]: samples = D6.rvs(10000) + D6.rvs(10000) 
   ...:         + D6.rvs(10000) + D100.rvs(10000)

In [4]: plt.hist(samples, bins=118-4); 
   ...: plt.xlabel('Sum of dice'); 
   ...: plt.ylabel('Frequency of each sum'); 
   ...: plt.show()

This gives the following screenshot that clearly indicates that our new random variable is not uniform:

Numerical setting

One way to approach this problem is to approximate the distribution of the variable from this data, and for that task, we use from the scipy.stats module the function gaussian_kde, which performs a kernel-density estimate using Gaussian kernels:

In [5]: kernel = gaussian_kde(samples)

This gaussian_kde object has methods similar to those of an actual random variable. To estimate the value of the corresponding probability of getting a 50, and the probability of obtaining a number greater than 100 in a throw of these four dice, we would issue, respectively:

In [6]: kernel(50)                     # The actual answer is 1/100
Out[6]: array([ 0.00970843])
In [7]: kernel.integrate_box_1d(0,100) # The actual answer is 177/200
Out[7]: 0.88395064140531865

Instead of estimating this sum of random variables, and again assuming we are not familiar with the actual result, we could create an actual random variable by defining its probability mass function in terms of the probability mass functions of the summands. The key? Convolution, of course, since the random variables for these dice are independent. The sample space is the set of numbers from 4 to 118 (space_sum in the following command), and the probabilities associated with each element (probs_sum) are computed as the convolution of the corresponding probabilities for each dice on their sample spaces:

In [8]: probs_6dice = D6.pmf(np.linspace(1,6,6)); 
   ...: probs_100dice = D100.pmf(np.linspace(1,100,100))
In [9]: probs_sum = np.convolve(np.convolve(probs_6dice,probs_6dice),
   ...:                    np.convolve(probs_6dice,probs_100dice)); 
   ...: space_sum = np.linspace(4, 118, 115)
In [10]: sum_of_dice = rv_discrete(name="sod",
   ....:                           values=(space_sum, probs_sum))
In [11]: sum_of_dice.pmf(50)
Out[11]: 0.0099999999999999985
In [12]: sum_of_dice.cdf(100)
Out[12]: 0.89500000000000057
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.75