Chapter 6. SciPy for Data Mining

This chapter covers those branches of mathematics and statistics that treat the collection, organization, analysis, and interpretation of data. There are different applications and operations that spread over several modules and submodules: scipy.stats (for purely statistical tools), scipy.ndimage.measurements (for analysis and organization of data), scipy.spatial (for spatial algorithms and data structures), and finally the clustering package scipy.cluster. The scipy.cluster clustering package consists of two submodules: scipy.cluster.vq (vector quantization) and scipy.cluster.hierarchy (for hierarchical and agglomerative clustering).

As in the previous chapters, fluency with the subject matter is assumed. Our emphasis is to show you some of the SciPy functions available to perform statistical computations, not to teach it. Accordingly, you are welcome to read this chapter along side your preferred book(s) on the subject so that you can fully explore the examples provided in this chapter on additional data sets.

We should mention, however, that there are other specialized modules in Python that can be used to explore this subject from different perspectives. Some of them (not covered by any means in this book) are the Modular Toolkit for Data Processing (MDP) (http://mdp-toolkit.sourceforge.net/install.html), scikit-learn (http://scikit-learn.org/), and Statsmodels (http://statsmodels.sourceforge.net/).

In this chapter, we will cover the following things:

  • The standard descriptive statistics measures computed via SciPy
  • The built-in functions in SciPy that deal with statistical distributions
  • The Scipy functionality to find interval estimation
  • Performing computations of statistical correlations and some statistical tests, the fitting of distributions, and statistical distances
  • A clustering example

Descriptive statistics

We often require the analysis of data in which certain features are grouped in different regions, each with different sizes, values, shapes, and so on. The scipy.ndimage.measurements submodule has the right tools for this task, and the best way to illustrate the capabilities of the module is by means of exhaustive examples. For example, for binary images of zeros and ones, it is possible to label each blob (areas of contiguous pixels with value one) and obtain the number of these with the label command. If we desire to obtain the center of mass of the blobs, we may do so with the center_of_mass command. We may see these operations in action once again in the application to obtain the structural model of oxides in Chapter 7, SciPy for Computational Geometry.

For nonbinary data, the scipy.ndimage.measurements submodule provides the usual basic statistical measurements (value and location of extreme values, mean, standard deviation, sum, variance, histogram, and so on).

For more advanced statistical measurements, we must access functions from the scipy.stats module. We may now use geometric and harmonic means (gmean, hmean), median, mode, skewness, various moments, or kurtosis (median, mode, skew, moment, kurtosis). For an overview of the most significant statistical properties of the dataset, we prefer to use the describe routine. We may also compute item frequencies (itemfreq), percentiles (scoreatpercentile, percentileofscore), histograms (histogram, histogram2), cumulative and relative frequencies (cumfreq, relfreq), standard error (sem), and the signal-to-noise ratio (signaltonoise), which is always useful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.239.118