Chapter 10. Predictive Analytics and Machine Learning

Predictive analytics and machine learning are hot, new research fields. They are new compared to other fields and, without a doubt, we can expect a lot of rapid growth. It is even predicted that machine learning will accelerate so fast that within mere decades human labor will be replaced by intelligent machines (see The current state of art is far from that utopia. A lot of computing power and data is still needed to make even simple decisions, such as determining whether pictures on the Internet contain dogs or cats. Predictive analytics uses a variety of techniques, including machine learning to make useful predictions, for instance, to determine whether a customer can repay his or her loans or identify female customers who are pregnant (see

To make these predictions, features are extracted from huge volumes of data. We mentioned features before—they are also called predictors. Features are input variables that can be used to make predictions. In essence, we have features found in our data and we are looking for a function that maps the features to a target, which may or may not be known. Finding the appropriate function can be hard; often, different algorithms and models are grouped together in so called ensembles. The output of an ensemble can be a majority vote or an average of a group of models, but we can also use a more advanced algorithm to produce the final result. We will not be using ensembles in this chapter, but it is something to keep in mind.

In the previous chapter, we got a taste of machine learning algorithms—the Naive Bayes classification algorithm. We can divide machine learning into the following categories:

  • Supervised learning: This requires us to label training data. For instance, if we want to classify spam, we need to provide examples of spam and normal e-mail messages.
  • Unsupervised learning: This doesn't require human input. This type of learning can discover patterns such as clusters in large datasets.
  • Reinforcement learning: This is learning without a tutor, but with some sort of feedback. For example, a computer can play chess against itself or if you remember the War Games movie from 1983 (see, think of tic-tac-toe and thermonuclear warfare.

We will use weather prediction as a running example. In this chapter, we will mostly use the Python scikit-learn library. This library has clustering, regression, and classification algorithms. However, some machine learning algorithms are not covered by scikit-learn so, for those, we will be using other APIs. The topics of this chapter are as follows:

  • A tour of scikit-learn
  • Preprocessing
  • Classification with logistic regression
  • Classification with support vector machines
  • Regression with ElasticNetCV
  • Support vector regression
  • Clustering with affinity propagation
  • Mean Shift
  • Genetic algorithms
  • Neural networks
  • Decision trees

A tour of scikit-learn

In the previous chapter, Chapter 9, Analyzing Textual Data and Social Media, we installed scikit-learn. With the file in this book's code bundle, we can print the following scikit-learn module descriptions:

The neural networks module is not very well supported at this moment, so it is recommended to use another library for neural networks. Note that there is a preprocessing module, which is the topic of the next section.

