Chapter 8. Inference and Data Analysis

The different techniques of descriptive statistics that we have covered in the previous chapter give us a straightforward presentation of facts from the data. The next logical step is Inference—the process of making propositions and drawing conclusions about a larger population than the sample data represents.

This chapter will cover the following topics:

  • Statistical inference.
  • Data mining and machine learning.

Statistical inference

Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. Inferential statistical analysis infers properties about a population; this includes testing hypotheses and deriving estimates.

There are three types of inference:

  • Estimation of the most appropriate single value of a parameter.
  • Interval estimation to assess what region of parameter values is most consistent with the given data.
  • Hypothesis testing to decide, between two options, what parameter values are most consistent with the data.

There are mainly three approaches to attack these problems:

  • Frequentist: Inference is judged based upon performance in repeated sampling.
  • Bayesian: Inference must be subjective. A prior distribution is chosen for the parameter we seek, and we combine the density of the data prior to obtain a joint distribution. A further application of Bayes Theorem gives us a distribution of the parameter, given the data. To perform computations in this setting, we use the package PyMC.
  • Likelihood: Inference is based on the fact that all information about the parameter can be obtained by inspection of a likelihood function, which is proportional to the probability density function.

In this section, we briefly illustrate the three approaches to each of the three inference types. We go back to the previous example of ratios of daily complaints of mortgages against credit cards:

In [1]: import numpy as np, pandas as pd, matplotlib.pyplot as plt
In [2]: data = pd.read_csv("Consumer_Complaints.csv", 
   ...:                     low_memory=False, parse_dates=[8,9])
In [3]: df = data.groupby(['Date received', 'Product']).size(); 
   ...: df = df.unstack(); 
   ...: ratios = df['Mortgage'] / df['Credit card']
In [4]: ratios.describe()
Out[4]:
count    1001.000000
mean        2.939686
std         1.341827
min         0.203947
25%         1.985507
50%         2.806452
75%         3.729167
max        12.250000
dtype: float64

By visual inspection of the histogram, we could very well assume that this data is a random sample from a normal distribution with parameters mu (average) and sigma (standard deviation). We further assume, for simplicity, that the scale parameter sigma is known, and its value is 1.3.

Tip

Later in this section, we will actually explore what tools we have in the SciPy stack to determine the distribution of data more precisely.

Estimation of parameters

In this setting, the problem we want to solve is an estimation of the average, mu, using the data obtained.

Frequentist approach

This is the simplest setting. The frequentist approach uses as estimate the computed mean of the data:

In [5]: ratios.mean()
Out[5]: 2.9396857495543731
In [6]: from scipy.stats import sem  # Standard error
In [7]: sem(ratios.dropna())
Out[7]: 0.042411109594665049

Tip

A frequentist would then say: The estimated value of the parameter mu is 2.9396857495543731 with standard error 0.042411109594665049.

Bayesian approach

For the Bayesian approach, we select a prior distribution for mu, which we conveniently assume is Normal with standard deviation 1.3. The average mu is regarded as a variable, and initially, we assume that its value could be anywhere in the range of the data (with Uniform distribution). We then use Bayes theorem to compute a posterior distribution for mu. Our estimated parameter is then the average of the posterior distribution of mu:

In [8]: import pymc as pm
In [9]: mu = pm.Uniform('mu', lower=ratios.min(), upper=ratios.max())
In [10]: observation = pm.Normal('obs', mu=mu, tau=1./1.3**2,
   ....:                        value=ratios.dropna(), observed=True)
In [11]: model = pm.Model([observation, mu])

Note

Notice how, in PyMC, the definition of a Normal distribution requires an average parameter mu, but instead of standard deviation or variance, it expects the precision tau = 1/sigma**2.

The variable observation combines our data with our proposed data-generation scheme, given by the variable mu, through the option value=ratios.dropna(). To make sure that this stays fixed during the analysis, we impose observed=True.

In the learning step, we employ the Markov Chain Monte Carlo (MCMC) method to return a large amount of random variables for the posterior distribution of mu:

In [12]: mcmc = pm.MCMC(model)
In [13]: mcmc.sample(40000, 10000, 1)
[---------------100%---------------] 40000 of 40000 complete in 4.5 sec
In [14]: mcmc.stats()
Out[14]:
{'mu': {'95% HPD interval': array([ 2.86064764,  3.02292213]),
  'mc error': 0.00028222883254203107,
  'mean': 2.9396811517572554,
  'n': 30000,
  'quantiles': {2.5: 2.8589908555161485,
   25: 2.9117191652137464,
   50: 2.9396815504225815,
   75: 2.9675088640073439,
   97.5: 3.0216312862055279},
  'standard deviation': 0.041412844137324857}}
In [15]: mcmc.summary()
Out[15]:
mu:
   Mean             SD               MC Error        95% HPD interval
   ------------------------------------------------------------------
   2.94             0.041            0.0              [ 2.861  3.023]
  
   Posterior quantiles:
   2.5             25              50              75            97.5
    |---------------|===============|===============|---------------|
   2.859            2.912           2.94           2.968        3.022
In [16]: from pymc.Matplot import plot as mcplot
In [17]: mcplot(mcmc); 
   ....: plt.show()
Plotting mu

We should get an output similar to the following:

Bayesian approach

The estimated value of the parameter is 2.93968. The standard deviation of the posterior distribution of mu is 0.0414.

Likelihood approach

We have a convenient method to perform the likelihood approach for estimation of parameters of any distribution represented as a class in the submodule scipy.stats. In our case, since we are fixing the standard deviation (the scale as parameter of the normal distribution for this particular class), we would issue the following command:

In [18]: from scipy.stats import norm as NormalDistribution
In [19]: NormalDistribution.fit(ratios.dropna(), fscale=1.3)
Out[19]: (2.9396857495543736, 1.3)

This gives us a similar value for the mean. The graph of the (non-negative log) likelihood function for mu can be obtained as follows:

In [20]: nnlf = lambda t: NormalDistribution.nnlf([t, 1.3],
   ....:                                  ratios.dropna()); 
   ....: nnlf = np.vectorize(nnlf)
In [21]: x = np.linspace(0, 14); 
   ....: plt.plot(x, nnlf(x), lw=2, color='r',
   ....:       label='Non-negative log-likely function for $mu$'); 
   ....: plt.legend(); 
   ....: plt.annotate('Minimum', xy=(2.9, nnlf(2.9)), xytext=(0,20),
   ....:         textcoords='offset points', ha='right', va='bottom',
   ....:         bbox=dict(boxstyle='round,pad=0.5', fc='yellow',
   ....:                   color='k', alpha=1),
   ....:         arrowprops=dict(arrowstyle='->', color='k',
   ....:                         connectionstyle='arc3,rad=0')); 
   ....: plt.show()

We should get an output similar to the following:

Likelihood approach

In any case, the result is visually what we would expect:

In [22]: distribution = NormalDistribution(loc=2.9396857495543736,
   ....:                                   scale=1.3)
In [23]: plt.plot(x, distribution.pdf(x), 'r-', lw=2,
   ....:          label='Computed Probability Density Function'); 
   ....: ratios.hist(bins=50, alpha=0.2, normed=True,
   ....:             label='Histogram of data (normalized)'); 
   ....: plt.legend(); 
   ....: plt.show()

We should get an output similar to the following:

Likelihood approach

Interval estimation

In this setting, we seek an interval of values for mu that are supported by the data.

Frequentist approach

In the frequentist approach, we start by providing a small confidence coefficient alpha, and proceed to find an interval so that the probability of including the parameter mu is 1-alpha. In our example, we set alpha = 0.05 (hence, the probability we impose is 95%), and proceed to compute the interval with the method interval of any class defining a continuous distribution in the module scipy.stats:

In [24]: loc = ratios.mean(); 
   ....: scale = ratios.sem(); 
   ....: NormalDistribution.interval(0.95, scale=scale, loc=loc)
Out[24]: (2.8565615022044484, 3.0228099969042979)

According to this method, values of the average mu between 2.8565615022044484 and 3.0228099969042979 are consistent with the data based on a 95% confidence interval.

Bayesian approach

In the Bayesian approach, the equivalent to the confidence interval is called a credible region (or interval), and is associated with the highest posterior density region—the set of most probable values of the parameter that, in total, constitute 100*(1 - alpha) percentage of the posterior mass.

Recall that when using a MCMC, after sampling, we obtained the credible region for alpha = 0.05.

To obtain credible intervals for other values of alpha, we use the routine hpd in the submodule pymc.utils directly. For example, the highest posterior density region for alpha = 0.01 is computed as follows:

In [25]: pm.utils.hpd(mcmc.trace('mu')[:], 1-.99)
Out[25]: array([ 2.83464531,  3.04706652])

Likelihood approach

This is also done with the aid of the method nnlf of any distribution. In this setting, we need to determine the interval of parameter values for which the likelihood exceeds 1/k where k is either 8 (strong evidence) or 32 (very strong evidence).

The estimation of the corresponding interval is then a simple application of optimization. We leave this as an exercise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.78.136