Chapter 10. Predictive Analytics and Machine Learning

Predictive analytics and machine learning are hot, new research fields. They are new compared to other fields and, without a doubt, we can expect a lot of rapid growth. It is even predicted that machine learning will accelerate so fast that within mere decades human labor will be replaced by intelligent machines (see http://en.wikipedia.org/wiki/Technological_singularity). The current state of art is far from that utopia. A lot of computing power and data is still needed to make even simple decisions, such as determining whether pictures on the Internet contain dogs or cats. Predictive analytics uses a variety of techniques, including machine learning to make useful predictions, for instance, to determine whether a customer can repay his or her loans or identify female customers who are pregnant (see http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/).

To make these predictions, features are extracted from huge volumes of data. We mentioned features before—they are also called predictors. Features are input variables that can be used to make predictions. In essence, we have features found in our data and we are looking for a function that maps the features to a target, which may or may not be known. Finding the appropriate function can be hard; often, different algorithms and models are grouped together in so called ensembles. The output of an ensemble can be a majority vote or an average of a group of models, but we can also use a more advanced algorithm to produce the final result. We will not be using ensembles in this chapter, but it is something to keep in mind.

In the previous chapter, we got a taste of machine learning algorithms—the Naive Bayes classification algorithm. We can divide machine learning into the following categories:

  • Supervised learning: This requires us to label training data. For instance, if we want to classify spam, we need to provide examples of spam and normal e-mail messages.
  • Unsupervised learning: This doesn't require human input. This type of learning can discover patterns such as clusters in large datasets.
  • Reinforcement learning: This is learning without a tutor, but with some sort of feedback. For example, a computer can play chess against itself or if you remember the War Games movie from 1983 (see http://en.wikipedia.org/wiki/WarGames), think of tic-tac-toe and thermonuclear warfare.

We will use weather prediction as a running example. In this chapter, we will mostly use the Python scikit-learn library. This library has clustering, regression, and classification algorithms. However, some machine learning algorithms are not covered by scikit-learn so, for those, we will be using other APIs. The topics of this chapter are as follows:

  • A tour of scikit-learn
  • Preprocessing
  • Classification with logistic regression
  • Classification with support vector machines
  • Regression with ElasticNetCV
  • Support vector regression
  • Clustering with affinity propagation
  • Mean Shift
  • Genetic algorithms
  • Neural networks
  • Decision trees

A tour of scikit-learn

In the previous chapter, Chapter 9, Analyzing Textual Data and Social Media, we installed scikit-learn. With the pkg_check.py file in this book's code bundle, we can print the following scikit-learn module descriptions:

sklearn version 0.15.0
sklearn.__check_build DESCRIPTION Module to give helpful messages to the user that did not compile the scikit properly. PACKAGE CONTENTS _check_build setup FUNCTI
sklearn.cluster DESCRIPTION The :mod:`sklearn.cluster` module gathers popular unsupervised clustering algorithms. PACKAGE CONTENTS _feature_agglomeration _h
sklearn.covariance DESCRIPTION The :mod:`sklearn.covariance` module includes methods and algorithms to robustly estimate the covariance of features given a set
sklearn.cross_decomposition 
sklearn.datasets DESCRIPTION The :mod:`sklearn.datasets` module includes utilities to load datasets, including methods to load and fetch popular reference da
sklearn.decomposition DESCRIPTION The :mod:`sklearn.decomposition` module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most o
sklearn.ensemble DESCRIPTION The :mod:`sklearn.ensemble` module includes ensemble-based methods for classification and regression. PACKAGE CONTENTS _gradient
sklearn.externals 
sklearn.feature_extraction DESCRIPTION The :mod:`sklearn.feature_extraction` module deals with feature extraction from raw data. It currently includes methods to extra
sklearn.feature_selection DESCRIPTION The :mod:`sklearn.feature_selection` module implements feature selection algorithms. It currently includes univariate filter sel
sklearn.gaussian_process DESCRIPTION The :mod:`sklearn.gaussian_process` module implements scalar Gaussian Process based predictions. PACKAGE CONTENTS correlation_mo
sklearn.linear_model DESCRIPTION The :mod:`sklearn.linear_model` module implements generalized linear models. It includes Ridge regression, Bayesian Regression, 
sklearn.manifold 
sklearn.metrics DESCRIPTION The :mod:`sklearn.metrics` module includes score functions, performance metrics and pairwise metrics and distance computations. 
sklearn.mixture 
sklearn.neighbors DESCRIPTION The :mod:`sklearn.neighbors` module implements the k-nearest neighbors algorithm. PACKAGE CONTENTS ball_tree base classification
sklearn.neural_network DESCRIPTION The :mod:`sklearn.neural_network` module includes models based on neural networks. PACKAGE CONTENTS rbm CLASSES sklearn.base.Bas
sklearn.preprocessing DESCRIPTION The :mod:`sklearn.preprocessing` module includes scaling, centering, normalization, binarization and imputation methods. PACKAGE
sklearn.semi_supervised DESCRIPTION The :mod:`sklearn.semi_supervised` module implements semi-supervised learning algorithms. These algorithms utilized small amount
sklearn.svm 
sklearn.tests 
sklearn.tree DESCRIPTION The :mod:`sklearn.tree` module includes decision tree-based models for classification and regression. PACKAGE CONTENTS _tree _ut
sklearn.utils 

The neural networks module is not very well supported at this moment, so it is recommended to use another library for neural networks. Note that there is a preprocessing module, which is the topic of the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.151.44