The different techniques of descriptive statistics that we have covered in the previous chapter give us a straightforward presentation of facts from the data. The next logical step is Inference—the process of making propositions and drawing conclusions about a larger population than the sample data represents.
This chapter will cover the following topics:
Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. Inferential statistical analysis infers properties about a population; this includes testing hypotheses and deriving estimates.
There are three types of inference:
There are mainly three approaches to attack these problems:
PyMC
.In this section, we briefly illustrate the three approaches to each of the three inference types. We go back to the previous example of ratios of daily complaints of mortgages against credit cards:
In [1]: import numpy as np, pandas as pd, matplotlib.pyplot as plt In [2]: data = pd.read_csv("Consumer_Complaints.csv", ...: low_memory=False, parse_dates=[8,9]) In [3]: df = data.groupby(['Date received', 'Product']).size(); ...: df = df.unstack(); ...: ratios = df['Mortgage'] / df['Credit card'] In [4]: ratios.describe() Out[4]: count 1001.000000 mean 2.939686 std 1.341827 min 0.203947 25% 1.985507 50% 2.806452 75% 3.729167 max 12.250000 dtype: float64
By visual inspection of the histogram, we could very well assume that this data is a random sample from a normal distribution with parameters mu
(average) and sigma
(standard deviation). We further assume, for simplicity, that the scale parameter sigma
is known, and its value is 1.3
.
In this setting, the problem we want to solve is an estimation of the average, mu
, using the data obtained.
This is the simplest setting. The frequentist approach uses as estimate the computed mean of the data:
In [5]: ratios.mean() Out[5]: 2.9396857495543731 In [6]: from scipy.stats import sem # Standard error In [7]: sem(ratios.dropna()) Out[7]: 0.042411109594665049
For the Bayesian approach, we select a prior distribution for mu
, which we conveniently assume is Normal with standard deviation 1.3
. The average mu
is regarded as a variable, and initially, we assume that its value could be anywhere in the range of the data (with Uniform distribution). We then use Bayes theorem to compute a posterior distribution for mu
. Our estimated parameter is then the average of the posterior distribution of mu
:
In [8]: import pymc as pm In [9]: mu = pm.Uniform('mu', lower=ratios.min(), upper=ratios.max()) In [10]: observation = pm.Normal('obs', mu=mu, tau=1./1.3**2, ....: value=ratios.dropna(), observed=True) In [11]: model = pm.Model([observation, mu])
Notice how, in PyMC
, the definition of a Normal distribution requires an average parameter mu
, but instead of standard deviation or variance, it expects the precision tau = 1/sigma**2
.
The variable observation
combines our data with our proposed data-generation scheme, given by the variable mu
, through the option value=ratios.dropna()
. To make sure that this stays fixed during the analysis, we impose observed=True
.
In the learning step, we employ the Markov Chain Monte Carlo (MCMC) method to return a large amount of random variables for the posterior distribution of mu
:
In [12]: mcmc = pm.MCMC(model) In [13]: mcmc.sample(40000, 10000, 1) [---------------100%---------------] 40000 of 40000 complete in 4.5 sec In [14]: mcmc.stats() Out[14]: {'mu': {'95% HPD interval': array([ 2.86064764, 3.02292213]), 'mc error': 0.00028222883254203107, 'mean': 2.9396811517572554, 'n': 30000, 'quantiles': {2.5: 2.8589908555161485, 25: 2.9117191652137464, 50: 2.9396815504225815, 75: 2.9675088640073439, 97.5: 3.0216312862055279}, 'standard deviation': 0.041412844137324857}} In [15]: mcmc.summary() Out[15]: mu: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------ 2.94 0.041 0.0 [ 2.861 3.023] Posterior quantiles: 2.5 25 50 75 97.5 |---------------|===============|===============|---------------| 2.859 2.912 2.94 2.968 3.022 In [16]: from pymc.Matplot import plot as mcplot In [17]: mcplot(mcmc); ....: plt.show() Plotting mu
We should get an output similar to the following:
The estimated value of the parameter is 2.93968
. The standard deviation of the posterior distribution of mu
is 0.0414
.
We have a convenient method to perform the likelihood approach for estimation of parameters of any distribution represented as a class in the submodule scipy.stats
. In our case, since we are fixing the standard deviation (the scale
as parameter of the normal distribution for this particular class), we would issue the following command:
In [18]: from scipy.stats import norm as NormalDistribution In [19]: NormalDistribution.fit(ratios.dropna(), fscale=1.3) Out[19]: (2.9396857495543736, 1.3)
This gives us a similar value for the mean. The graph of the (non-negative log) likelihood function for mu
can be obtained as follows:
In [20]: nnlf = lambda t: NormalDistribution.nnlf([t, 1.3], ....: ratios.dropna()); ....: nnlf = np.vectorize(nnlf) In [21]: x = np.linspace(0, 14); ....: plt.plot(x, nnlf(x), lw=2, color='r', ....: label='Non-negative log-likely function for $mu$'); ....: plt.legend(); ....: plt.annotate('Minimum', xy=(2.9, nnlf(2.9)), xytext=(0,20), ....: textcoords='offset points', ha='right', va='bottom', ....: bbox=dict(boxstyle='round,pad=0.5', fc='yellow', ....: color='k', alpha=1), ....: arrowprops=dict(arrowstyle='->', color='k', ....: connectionstyle='arc3,rad=0')); ....: plt.show()
We should get an output similar to the following:
In any case, the result is visually what we would expect:
In [22]: distribution = NormalDistribution(loc=2.9396857495543736, ....: scale=1.3) In [23]: plt.plot(x, distribution.pdf(x), 'r-', lw=2, ....: label='Computed Probability Density Function'); ....: ratios.hist(bins=50, alpha=0.2, normed=True, ....: label='Histogram of data (normalized)'); ....: plt.legend(); ....: plt.show()
We should get an output similar to the following:
In this setting, we seek an interval of values for mu
that are supported by the data.
In the frequentist approach, we start by providing a small confidence coefficient alpha
, and proceed to find an interval so that the probability of including the parameter mu
is 1-alpha
. In our example, we set alpha = 0.05
(hence, the probability we impose is 95%), and proceed to compute the interval with the method interval
of any class defining a continuous distribution in the module scipy.stats
:
In [24]: loc = ratios.mean(); ....: scale = ratios.sem(); ....: NormalDistribution.interval(0.95, scale=scale, loc=loc) Out[24]: (2.8565615022044484, 3.0228099969042979)
According to this method, values of the average mu
between 2.8565615022044484
and 3.0228099969042979
are consistent with the data based on a 95% confidence interval.
In the Bayesian approach, the equivalent to the confidence interval is called a credible region
(or interval), and is associated with the highest posterior density region—the set of most probable values of the parameter that, in total, constitute 100*(1 - alpha)
percentage of the posterior mass.
Recall that when using a MCMC, after sampling, we obtained the credible region for alpha = 0.05
.
To obtain credible intervals for other values of alpha, we use the routine hpd
in the submodule pymc.utils
directly. For example, the highest posterior density region for alpha = 0.01
is computed as follows:
In [25]: pm.utils.hpd(mcmc.trace('mu')[:], 1-.99) Out[25]: array([ 2.83464531, 3.04706652])
This is also done with the aid of the method nnlf
of any distribution. In this setting, we need to determine the interval of parameter values for which the likelihood exceeds 1/k
where k
is either 8 (strong evidence) or 32 (very strong evidence).
The estimation of the corresponding interval is then a simple application of optimization. We leave this as an exercise.
3.145.78.136