Predictive analytics and machine learning are hot, new research fields. They are new compared to other fields and, without a doubt, we can expect a lot of rapid growth. It is even predicted that machine learning will accelerate so fast that within mere decades human labor will be replaced by intelligent machines (see http://en.wikipedia.org/wiki/Technological_singularity). The current state of art is far from that utopia. A lot of computing power and data is still needed to make even simple decisions, such as determining whether pictures on the Internet contain dogs or cats. Predictive analytics uses a variety of techniques, including machine learning to make useful predictions, for instance, to determine whether a customer can repay his or her loans or identify female customers who are pregnant (see http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/).
To make these predictions, features are extracted from huge volumes of data. We mentioned features before—they are also called predictors. Features are input variables that can be used to make predictions. In essence, we have features found in our data and we are looking for a function that maps the features to a target, which may or may not be known. Finding the appropriate function can be hard; often, different algorithms and models are grouped together in so called ensembles. The output of an ensemble can be a majority vote or an average of a group of models, but we can also use a more advanced algorithm to produce the final result. We will not be using ensembles in this chapter, but it is something to keep in mind.
In the previous chapter, we got a taste of machine learning algorithms—the Naive Bayes classification algorithm. We can divide machine learning into the following categories:
We will use weather prediction as a running example. In this chapter, we will mostly use the Python scikit-learn library. This library has clustering, regression, and classification algorithms. However, some machine learning algorithms are not covered by scikit-learn so, for those, we will be using other APIs. The topics of this chapter are as follows:
In the previous chapter, Chapter 9, Analyzing Textual Data and Social Media, we installed scikit-learn. With the pkg_check.py
file in this book's code bundle, we can print the following scikit-learn module descriptions:
sklearn version 0.15.0 sklearn.__check_build DESCRIPTION Module to give helpful messages to the user that did not compile the scikit properly. PACKAGE CONTENTS _check_build setup FUNCTI sklearn.cluster DESCRIPTION The :mod:`sklearn.cluster` module gathers popular unsupervised clustering algorithms. PACKAGE CONTENTS _feature_agglomeration _h sklearn.covariance DESCRIPTION The :mod:`sklearn.covariance` module includes methods and algorithms to robustly estimate the covariance of features given a set sklearn.cross_decomposition sklearn.datasets DESCRIPTION The :mod:`sklearn.datasets` module includes utilities to load datasets, including methods to load and fetch popular reference da sklearn.decomposition DESCRIPTION The :mod:`sklearn.decomposition` module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most o sklearn.ensemble DESCRIPTION The :mod:`sklearn.ensemble` module includes ensemble-based methods for classification and regression. PACKAGE CONTENTS _gradient sklearn.externals sklearn.feature_extraction DESCRIPTION The :mod:`sklearn.feature_extraction` module deals with feature extraction from raw data. It currently includes methods to extra sklearn.feature_selection DESCRIPTION The :mod:`sklearn.feature_selection` module implements feature selection algorithms. It currently includes univariate filter sel sklearn.gaussian_process DESCRIPTION The :mod:`sklearn.gaussian_process` module implements scalar Gaussian Process based predictions. PACKAGE CONTENTS correlation_mo sklearn.linear_model DESCRIPTION The :mod:`sklearn.linear_model` module implements generalized linear models. It includes Ridge regression, Bayesian Regression, sklearn.manifold sklearn.metrics DESCRIPTION The :mod:`sklearn.metrics` module includes score functions, performance metrics and pairwise metrics and distance computations. sklearn.mixture sklearn.neighbors DESCRIPTION The :mod:`sklearn.neighbors` module implements the k-nearest neighbors algorithm. PACKAGE CONTENTS ball_tree base classification sklearn.neural_network DESCRIPTION The :mod:`sklearn.neural_network` module includes models based on neural networks. PACKAGE CONTENTS rbm CLASSES sklearn.base.Bas sklearn.preprocessing DESCRIPTION The :mod:`sklearn.preprocessing` module includes scaling, centering, normalization, binarization and imputation methods. PACKAGE sklearn.semi_supervised DESCRIPTION The :mod:`sklearn.semi_supervised` module implements semi-supervised learning algorithms. These algorithms utilized small amount sklearn.svm sklearn.tests sklearn.tree DESCRIPTION The :mod:`sklearn.tree` module includes decision tree-based models for classification and regression. PACKAGE CONTENTS _tree _ut sklearn.utils
The neural networks module is not very well supported at this moment, so it is recommended to use another library for neural networks. Note that there is a preprocessing module, which is the topic of the next section.
18.218.151.44