Chapter 12

Creating Basic Prediction Examples

IN THIS CHAPTER

Installing the machine-learning software

Working with a sample dataset

Creating simple predictive models

Visualizing and evaluating your results

This chapter is about installing and setting up the machine-learning software and using the Python programming language to create a few simple predictive models. There are some modules to install and it will take a bit of time, so make sure you have plenty of battery life left if you're working on your laptop. If you already have Python installed prior to reading this book, make sure you're installing the correct versions of the machine-learning modules and dependencies for the Python version you're using.

In this book, Python version 2.7.11 is being used on a Windows 10 machine, but the installation instructions should work on older versions of Windows.

If you're following along in this chapter, you'll dive right in and start working with a sample dataset. Don't worry too much about the coding involved; most of the code will be provided and you can run it directly in the Python interactive interpreter, line by line. For most lines of code, you'll see what the output is. If, for some reason, an error crops up, you'll know exactly which line caused the error. Easy stuff.

Installing the Software Packages

The goal here is to build a couple of predictive models using different classification algorithms. To do that, you'll need to install Python, its machine-learning modules, and its dependencies. The setup process can take from 30 minutes to an hour, depending on your available Internet speed and your experience level in installing projects that require dependencies or multiple other projects.

You can choose from a variety of programming languages and add-on packages to create and run predictive models. Python, together with the scikit-learn module, is an easy and powerful combination of programming language and machine-learning package to use, learn, and get started with quickly.

Python is used widely in production systems, and is a requirement in many data science jobs.

Compared to other programming languages, Python is relatively easy to learn. Its syntax is straightforward and the code can be executed directly in an interactive console. You'll know immediately if you wrote a successful statement, and can learn quickly from trial and error in many cases.

Installing Python

Installing Python is an easy process that takes less than thirty minutes and just several clicks of the mouse. All of the default settings can be accepted during the installation process. You can install Python by downloading the installation program for Windows and other operating systems from the Python website at www.python.org. This chapter guides you through the installation process for the Windows operating system and Python release version 2.7.11. After you get to the Python website, you can look for the downloads link to get the file. After you've downloaded the file, just navigate to the folder where you downloaded it, then double-click on the file to begin the installation process.

Windows x86 MSI installer is available at

https://www.python.org/ftp/python/2.7.11/python-2.7.11.msi

To install Python, launch the installer and follow these steps:

  1. Choose which users you want to install Python for, then click Next.

    You can choose between all users or just yourself. Either choice is fine here. Figure 12-1 shows a prompt to select which user to install Python for. The default is to install for all users of the computer.

  2. Choose the destination directory, then click Next.

    Figure 12-2 shows a prompt to select the location where you want Python to be installed.

  3. Customize Python installation features, then click Next.

    For a new user, the default is fine. Figure 12-3 shows a prompt to select custom installation features and how much disk space is required for those features.

    After a minute or so, the installation should be complete. Figure 12-4 shows that the installation of Python is complete. Click Finish.

image

FIGURE 12-1: A prompt to choose which user to install Python for.

image

FIGURE 12-2: A prompt to choose the destination directory.

image

FIGURE 12-3: A prompt to choose custom installation features.

image

FIGURE 12-4: A prompt to showing the installation is complete.

Installing the machine-learning module

If you're familiar with installing software packages, you can just download these files, then skip to the “Checking your installation” section:

If you want to follow along with the same installation setup as used by this book, follow these instructions on the scikit-learn website (http://scikit-learn.org) to get the machine learning package:

  1. Click the link to the installation page.
  2. Look for the information on installing the latest release.

    You can follow the instructions listed on the site, which requires that you already have working installations of the dependent modules already or install a third-party distribution. However, the following steps will install everything you need.

    You can download the latest version of scikit-learn for your operating system from SourceForge, a source code repository website.

    Here is a direct link to the scikit project at SourceForge:

    https://sourceforge.net/projects/scikit-learn/files

    You'll be using the executable file scikit-learn-0.17.win32-py2.7.exe for your installation.

  3. Click the link with the executable filename.

    Within about a minute, the download should be complete. Go to the download’s folder or wherever your default download’s folder is located. Double-click the file to start the installation process.

    warning Depending on the version of Windows and web browser you're using, you may receive a few warning prompts to download and execute the installer for these modules.

    The first prompt is a prompt to download the scikit-learn machine-learning module, as shown in Figure 12-5.

  4. Click the Save File button and wait for the download to finish.
  5. When the download is finished, go to the folder where you saved the file and run the file by double-clicking the filename.

    This may open a series of prompts or warnings (similar to Figure 12-6 and Figure 12-7) that ask whether you want to proceed with running an executable file.

  6. Click the OK / Run button and continue.

    The next screen, shown in Figure 12-8, offers some important and useful information about the scikit-learn project.

  7. After you finish reading the information, click the Next button.

    During the installation process, the scikit installer may ask you to select some custom options. In most cases, accepting the default selections will be sufficient for running the examples.

    tip When a screen asks where you want the module installed (as shown in Figure 12-9), we recommend accepting the default directory. Doing so simplifies the installation process, as there are other dependent modules you need to install. C:Python27Libsite-packages is the default installation directory for third-party modules.

  8. Click the Next button.

    You're now ready to install scikit-learn. Figure 12-10 shows one final prompt that appears before installation begins.

  9. Click the Next button.

    After the status bar is complete, you're notified that your installation is complete (as shown in Figure 12-11).

  10. Click the Finish button.

    You're done installing the main module for scikit. You're now ready to install its dependencies.

image

FIGURE 12-5: A prompt to download the scikit-learn machine-learning module.

image

FIGURE 12-6: System warning that you're opening an executable file.

image

FIGURE 12-7: System warning that this software is from an unknown publisher.

image

FIGURE 12-8: Important information about the scikit-learn project.

image

FIGURE 12-9: The directory where the module is to be installed.

image

FIGURE 12-10: Ready to install.

image

FIGURE 12-11: Finished installation message.

Installing the dependencies

The scikit-learn module requires (or is dependent on) a few other modules to be installed before you can start using it. Modules that are dependent on other modules are called dependencies. In this case, the dependencies are numpy, scipy, and matplotlib.

remember You need to install the following dependencies:

  • numpy
  • scipy
  • matplotlib

These packages may be available from several locations:

remember Choosing the versions that have Windows installers will make the installation process quicker and as simple as possible.

Installing the dependencies is similar to installing scikit-learn. It's a series of prompts and clicks. To stay consistent across all dependencies, choose the default options.

Installing numpy

The following section details the steps needed to install numpy. You may download numpy from the SourceForge website.

  1. From the SourceForge website, do a search for numpy in the search form.

    Many listings show up. The needed module is numpy-1.10.2. If you search for it, it should appear as the top listing onscreen. To be sure that you have the same file, check to make sure it has the following description:

    Numerical Python: Numerical Python adds a fast and sophisticated array facility to the Python language.

  2. Click the Numerical Python link, then the Files tab link, then the NumPy folder link, then the 1.10.2 folder link to go to the folder with the latest binary distribution of numpy.

    Here is a direct link to the download page:

    https://sourceforge.net/projects/numpy/files/NumPy/1.10.2

  3. Click the numpy-1.10.2-win32-superpack-python2.7.exe link.

    Within a few seconds, the file numpy-1.10.2-win32-superpack-python2.7.exe should automatically start downloading.

  4. Go to your downloads folder (or wherever you saved the file) and run the file by double-clicking the filename.

    This may open a series of prompts or warnings that will ask whether you want to proceed with running an executable file. They'll be similar to those that show up when you install scikit.

  5. Click the OK/Run/Allow button and continue.

    A screen showing some important and useful information about the numpy project (similar to Figure 12-8) appears.

  6. Click the Next button.

    A screen similar to Figure 12-9 appears, asking where you want the numpy module installed.

  7. Accept the default location of the setup and click Next.

    A screen appears, displaying one final prompt before installation begins, as shown in Figure 12-10.

  8. Clicking Next begins the installation process.

    When the status bar is finished, you're notified that your installation is complete, as shown in Figure 12-11.

  9. Click the Finish button and then the Close button.

    That's it for this dependency — numpy is installed.

Installing scipy

The following section details the steps needed to install scipy. You may download scipy from the SourceForge website. The installation process is pretty much the same as the installation for numpy.

  1. From the SourceForge website, do a search for scipy in the search form.

    The top listing from the search should be

    SciPy: Scientific Library for Python

  2. Click the SciPy link, then the Files tab link, then the scipy folder link, then the 0.16.1 folder link to go to the folder with the latest binary distribution of SciPy.

    Here is a direct link to the download page:

    https://sourceforge.net/projects/scipy/files/scipy/0.16.1

  3. Click the scipy-0.16.1-win32.superpack-python2.7.exe link and wait for the download to finish.

The rest of the installation process is the same as listed for numpy.

Installing matplotlib

The final module to install is matplotlib. To get the executable file, go to the SourceForge website search for matplotlib. The matplotlib version for this example is matplotlib 1.2.1. Once again, the rest of the installation process is the same as listed for numpy and scipy.

Here is a direct link to the download page:

https://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.2.1

Here is a direct link to the file:

https://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.2.1.win32-py2.7.exe/download

Checking your installation

When you've installed scikit-learn and all its dependencies, be sure you confirm that the installation went as expected. You want to avoid running into any problem or unexpected errors later on.

  1. Go to the Python interactive shell by choosing Windows Start button ⇒ Python2.7 ⇒ Python (command line).

    The process is similar if you did a custom installation of Python.

  2. In the interactive shell, try running in the following statement to import all the modules that you installed:

    >>> import sklearn, numpy, scipy, matplotlib

    If the Python interpreter returns no errors, then your installation succeeded, as shown in Figure 12-12.

    remember If you get an error like the one shown in Figure 12-13, then something went wrong in the installation process. You'll have to reinstall the module that is listed in the line that begins with ImportError.

    Assuming everything went as planned, then you're ready to begin using scikit-learn to build a predictive model.

image

FIGURE 12-12: Here's what you see if Python successfully imported the modules.

image

FIGURE 12-13: An error message states that Python can't import a module.

Preparing the Data

When you're learning a new programming language, it's customary to write the “hello world” program. For machine learning and predictive analytics, creating a model to classify the Iris dataset is its “hello world” equivalent program. This is a rather simple example, but it's very effective in teaching the basics of machine learning and predictive analytics.

Getting the sample dataset

To create our predictive model, you'll need to download the sample Iris dataset. This dataset is freely available from many sources, especially at academic institutions that have machine-learning departments. Fortunately, the folks at scikit-learn were nice enough to include some sample datasets and data-loading functions along with their package. So, for the purposes of these examples, you'll only need to run a couple of simple lines of code to load the data.

Labeling your data

Table 12-1 shows one observation and its features from each class of the Iris Flower dataset.

TABLE 12-1 The Iris Flower Dataset

Sepal
Length

Sepal
Width

Petal
Length

Petal
Width

Target Class/
Label

5.1

3.5

1.4

0.2

Setosa (0)

7.0

3.2

4.7

1.4

Versicolor (1)

6.3

3.3

6.0

2.5

Virginica (2)

The Iris Flower dataset is a real multivariate dataset of three classes of the Iris flower (Iris setosa, Iris virginica, and Iris versicolor) introduced by Ronald Fisher in his 1936 article, “The Use of Multiple Measurements in Taxonomic Problems.” This dataset is best known for its extensive use in academia for machine learning and statistics. The dataset consists of 150 total instances, with 50 instances from each of the 3 classes of the Iris flower. The sample has 4 features (also commonly called attributes), which are the length and width measurements of the sepals and petals.

The interesting part of this dataset is that the three classes are somewhat linearly separable. The Setosa class can be separated from the other two classes by drawing a straight line on the graph between them. The Virginica and Versicolor classes can't be perfectly separated using a straight line — although it is close. This makes it a perfect candidate dataset to do classification analysis but not so good for clustering analysis.

The sample data was already labeled. The right column (Label) of Table 12-1 shows the names of each class of the Iris flower. The class name is called a label or a target; it's usually assigned to a variable named y. It's basically the outcome or the result of what is being predicted. In statistics and modeling, it is often referred to as the dependent variable. It depends on the inputs that correspond to sepal length and width and to petal length and width.

You may also want to know what's different about the scikit preprocessed Iris dataset, as compared to the original dataset. To find out, you need to obtain the original data file. You can do a Google search for iris dataset and download it or view it from any one of the academic institutions. The result that usually comes up first is the University of California Irvine's (UCI) machine-learning repository of datasets. Here is a direct link to the Iris dataset in its original state from the UCI machine-learning repository:

http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

If you download it, you should be able to view it with any text editor. Upon viewing the data in the file, you'll notice that there are five columns in each row. The first four columns are the measurements (referred to as the features) and the last column is the label. The label differs between the original and scikit version of the Iris dataset. Another difference is the first row of the scikit data file. It includes a header row used by the scikit data-loading function. It has no effect on the algorithms themselves.

tip Transforming features to numbers rather than keeping them as text makes it easier for the algorithms to process — and it's much more memory-efficient. This is especially evident if you run very large datasets with many features — which is often the case in real scenarios.

Table 12-2 shows sample data from both files. All the data columns are the same except for Col5. Note that scikit has class names with numerical labels; the original file has text labels.

TABLE 12-2 Sample Data

Source

Col1

Col2

Col3

Col4

Col5

scikit

5.1

3.5

1.4

0.2

0

original

5.1

3.5

1.4

0.2

Iris-setosa

scikit

7.0

3.2

4.7

1.4

1

original

7.0

3.2

4.7

1.4

Iris-versicolor

scikit

6.3

3.3

6.0

2.5

2

original

6.3

3.3

6.0

2.5

Iris-virginica

Making Predictions Using Classification Algorithms

You have all the tools and data necessary to start creating a predictive model. Now the fun begins.

In general, creating a learning model for classification tasks will entail the following steps:

  1. Load the data.
  2. Choose a classifier.
  3. Train the model.
  4. Visualize the model.
  5. Test the model.
  6. Evaluate the model.

Creating a supervised learning model with SVM

Supervised learning is a machine-learning task that learns from data that has been labeled. One way to think about supervised learning is that the labeling of data is done under the supervision of the modeler; unsupervised learning, by contrast, doesn't require labeled data. Supervised learning is commonly performed using a classification algorithm. In this section, you will use the Support Vector Machine classification algorithm to create a supervised learning model.

Loading your data

You need to load the data for your algorithms to use. Loading the Iris dataset in scikit is as simple as issuing a couple of lines of code because scikit has already created a function to load the dataset.

  1. Open a new Python interactive shell session.

    Use a new Python session so there isn't anything left over in memory and you have a clean slate to work with.

  2. Enter the following code in the prompt and observe the output:

    >>> from sklearn.datasets import load_iris
    >>> iris = load_iris()

    After running those two statements, you shouldn't see any messages from the interpreter. The variable iris should contain all the data from the iris.csv file.

Before you create a predictive model, it's important to understand a little about the new variable iris and what you can do with it. It makes the code easier to follow and the process much simpler to grasp. You can inspect the value of iris by typing it in the interpreter.

>>> iris

The output will be all the content from the iris.csv file, along with some other information about the dataset that the load_iris function loaded into the iris variable. The iris variable is a dictionary data structure with four main properties. The important properties of iris are listed on Table 12-3.

TABLE 12-3 Main Properties of Iris Variable

Property Name

Description

data

Contains all the measurements of the observations.

feature_name

Contains the name of the feature (attribute name).

target

Contains all the targets (labels) of the observations.

target_names

Contains the names of the iris classes.

tip You can print out the values in the interpreter by typing the variable name followed by dot followed by property name. An example is using iris.data to access the data property of iris, like this:

>>> iris.data

This is a standard way of accessing properties of an object in many programming languages.

To create an instance of the SVM classifier, type the following code in the interpreter:

>>> from sklearn.svm import LinearSVC
>>> svmClassifier = LinearSVC(random_state=111)

The first line of code imports the LinearSVC library into the session. The linear Support Vector Classifier (SVC) is an implementation of SVM for linear classification and has multi-class support. The dataset is somewhat linearly separable and has three classes, so it would be a good idea to experiment with LinearSVC to see how it performs. (You can read more about SVM in Chapter 7.)

The second line creates the instance using the variable svmClassifier. This is an important variable to remember; you'll see it used several more times in the chapter. The random_state parameter allows us to reproduce these examples and get the same results. If you didn't put in the random_state parameter, your results may differ from the ones shown here.

Running the training data

Before you can feed the SVM classifier with the data that was loaded, you must split the full dataset into a training set and test set.

Fortunately, scikit-learn has implemented a function that will help you to easily split the full dataset. The train_test_split function takes as input a single dataset and a percentage value. The percentage value is used to determine the size of the test set. The function returns two datasets: the test dataset (with its size specified) and the training dataset (which uses the remaining data).

Typically, one can take around 70-80 percent of the data to use as a training set and use the remaining data as the test set. But the Iris dataset is very small (only 150 instances), so you can take 90 percent of it to train the model and use the other 10 percent as test data to see how your predictive model will perform.

warning In Python, the left indentation level of each statement is significant. In the interpreter, every new statement will begin with >>>. For the sample code in this book, if you don't see >>> in the beginning of a new line and it's indented, then it means it's a continuation of the preceding line and should be typed in as a single line (don't hit carriage return until the whole statement has been entered). If the next line doesn't have >>> in the beginning and it isn't indented, then it's the output from the interpreter. The code was formatted this way for better readability.

Type in the following code to split your dataset:

>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test =
cross_validation.train_test_split(iris.data,
iris.target, test_size=0.10, random_state=111)

The first line imports cross-validation library into your session. The second line creates the test set from 10 percent of the sample.

  • X_train will contain 135 observations and its features.
  • y_train will contain 135 labels in the same order as the 135 X_train observations.
  • X_test will contain 15 (or 10 percent) observations and its features.
  • y_test will contain 15 labels in the same order as the 15 X_test observations.

The following code verifies that the split is what you expected:

>>> X_train.shape
(135, 4)
>>> y_train.shape
(135,)
>>> X_test.shape
(15, 4)
>>> y_test.shape
(15,)

You can see from the output that there are 135 observations with 4 features and 135 labels in the training set. The test set has 15 observations with 4 features and 15 labels.

warning Many beginners in the field of predictive analytics forget to split the datasets — which introduces a serious design flaw into the project. If the full 150 instances were loaded into the machine as training data, that would leave no unseen data for testing the model. Then you'd have to resort to reusing some of the training instances to test the predictive model. You'll see that in such a situation, the model always predicts the correct class — because you're using the same exact data you used to train the model. The model has already seen this pattern before; it will have no problem just repeating what it's seen. A working predictive model needs to make predictions for data that it hasn't seen yet.

When you have an instance of an SVM classifier, a training dataset, and a test dataset, you're ready to train the model with the training data. Typing the following code into the interpreter will do exactly that:

>>> svmClassifier.fit(X_train,y_train)

This line of code creates a working model to make predictions from. Specifically, a predictive model that will predict what class of Iris a new unlabeled dataset belongs to. The svmClassifier instance will have several methods that you can call to do various things. For example, after calling the fit method, the most useful method to call is the predict method. That's the method to which you'll feed new data; in return, it predicts the outcome.

Running the test data

Using 10 percent of the 150 instances from the dataset gives you 15 test-data points to run through the model. Let's see how your predictive model will perform. Type the following code listing into the interpreter:

>>> predicted = svmClassifier.predict(X_test)
>>> predicted
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])

The predict function in the first line of code is what does the prediction, as you may have guessed. It takes the test data as input and outputs the results into the variable predicted. The second line prints the output. The last line in the code section is the output, or prediction: an array of 15 — that is, 10 percent of the sample dataset, which is the size of the test dataset. The numbers in the array represent the Iris Flower classes.

Evaluating the model

To evaluate the accuracy of your model, you can compare the output array with the y_test array. For this small sample dataset, you can easily tell how it performed by seeing that the output array from the predict function is almost the same as the y_test array. The last line in the code is a simple equality check between the two arrays, sufficient for this simple test case. Here's the code:

>>> predicted
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])
>>> y_test
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> predicted == y_test
array([ True, True, True, True, True, True, True,
True, True, True, True, True, False, True,
True], dtype=bool)

Looking at the output array with all the Boolean (True and False) values, you can see that the model predicted all but one outcome. On the thirteenth data point, it predicted 1 (Versicolor) when it should have been 2 (Virginica). The False value(s) indicate that the model predicted the incorrect Iris class for that data point. The percentage of correct predictions will determine the accuracy of the predictive model. In this case you can simply use basic division and get the accuracy:

correct outcomes / test size => 14 / 15 => 0.9333 or 93.33 percent

It's no surprise that the model failed to predict Virginica or Versicolor; they're clearly not separable by a straight line. A failure to predict Setosa, however, would be surprising because Setosa is clearly linearly separable. Still, the accuracy was 14 out of 15, or 93.33 percent.

For a test set with more data points, you may want to use the metrics module to do your measurements. The following code will get the accuracy of the model:

>>> from sklearn import metrics
>>> metrics.accuracy_score(y_test, predicted)
0.93333333333333335

The mean absolute error of the model is simply 1 – the accuracy score. In this case, it would be 1 – 0.9333. Here is the code to get this value:

>>> metrics.mean_absolute_error(y_test, predicted)
0.066666666666666666

Another useful measurement tool is the confusion matrix. Yes, it's real. It's a matrix (tabular format) that shows the predictions that the model made on the test data. Here is the code that displays the confusion matrix:

>>> metrics.confusion_matrix(y_test, predicted)
array([[5, 0, 0],
[0, 2, 0],
[0, 1, 7]])

The diagonal line from the top-left corner to the bottom-right corner is the number of correct predictions for each row. Each row corresponds to a class of Iris. For example: The first row corresponds to the Setosa class. The model predicted five correct test data points and had no errors predicting the Setosa class. If it had an error, a number other than zero would be present in any of the columns in that row. The second row corresponds to the Versicolor class. The model predicted two correct test data points and no errors. The third row corresponds to the Virginica class. The model predicted seven correct test data points but also had one error. The model mistakenly predicted one observation of Virginica for a Versicolor. You can tell that by looking at the column where the error is showing up. Column 1 (the second column, because Python arrays start at 0) belongs to Versicolor.

The accuracy of a predictive model's results will directly affect the decision to deploy that model; the higher the accuracy, the more easily you can gather support for deploying the model.

tip When creating a predictive model, start by building a simple working solution quickly — and then continue to build iteratively until you get the desired outcome. Spending months building a predictive model — and not being able to show your stakeholders any results — is a sure way to lose the attention and support of your stakeholders.

Here is the full listing of the code to create and evaluate a SVM classification model:

>>> from sklearn.datasets import load_iris
>>> from sklearn.svm import LinearSVC
>>> from sklearn import cross_validation
>>> from sklearn import metrics
>>> iris = load_iris()
>>> X_train, X_test, y_train, y_test =
cross_validation.train_test_split(iris.data,
iris.target, test_size=0.10, random_state=111)
>>> svmClassifier = LinearSVC(random_state=111)
>>> svmClassifier.fit(X_train, y_train)
>>> predicted = svmClassifier.predict(X_test)
>>> predicted
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])
>>> y_test
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> metrics.accuracy_score(y_test, predicted)
0.93333333333333335
>>> predicted == y_test
array([ True, True, True, True, True, True, True,
True, True, True, True, True, False, True,
True], dtype=bool)

Visualizing the classifier

Looking at the decision surface area on the plot, as shown in Figure 12-14, it looks like some tuning has to be done. If you look near the middle of the plot, you can see that many of the data points belonging to the middle area (Versicolor) are lying in the area to the right side (Virginica).

image

FIGURE 12-14: Classification based on logistic regression with C=1.

Figure 12-15 shows the decision surface with a C value of 150. It visually looks better, so choosing to use this setting for your logistic regression model seems appropriate.

image

FIGURE 12-15: Classification based on logistic regression with C=150.

Creating a supervised learning model with logistic regression

After you build your first classification predictive model, creating more models like it is a really straightforward task in scikit. The only real difference from one model to the next is that you may have to tune the parameters from algorithm to algorithm. We felt it was important to include a few examples of the classification technique so that you can feel comfortable trying other algorithms from the scikit library.

Loading your data

This code listing will load the iris dataset into your session:

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()

Creating an instance of the classifier

The following two lines of code creates an instance of the classifier. The first line imports the logistic regression library. The second line creates an instance of the logistic regression algorithm.

>>> from sklearn import linear_model
>>> logClassifier = linear_model.LogisticRegression(C=1,
random_state=111)

Notice the C parameter (regularization parameter) in the constructor. The regularization parameter is used to prevent overfitting (see Chapter 15 for more about overfitting). The parameter isn't strictly necessary (the constructor will work fine without it because it will default to C=1). In a later section, however, we'll be creating a logistic regression classifier, using C=150 because it creates a better plot of the decision surface, so we're just introducing it here. (You can see both plots in the “Visualizing the classifier” section in the chapter, in Figures 12-16 and 12-17.)

image

FIGURE 12-16: Plotting data elements from the Iris dataset.

image

FIGURE 12-17: Classification based on Support Vector Machine.

Running the training data

You'll need to split the dataset into training and test sets before you can create an instance of the logistic regression classifier. The following code will accomplish that task:

>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test =
cross_validation.train_test_split(iris.data,
iris.target, test_size=0.10, random_state=111)
>>> logClassifier.fit(X_train, y_train)

  1. Line 1 imports the library that allows us to split the dataset into two parts.
  2. Line 2 calls the function from the library that splits the dataset into two parts and assigns the now-divided datasets to two pairs of variables.
  3. Line 3 takes the instance of the logistic regression classifier you just created and calls the fit method to train the model with the training dataset.

Running the test data

In the following code, the first line feeds the test dataset to the model and the third line displays the output:

>>> predicted = logClassifier.predict(X_test)
>>> predicted
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])

Evaluating the model

You can cross-reference the output from the prediction against the y_test array. As a result, you can see that it predicted all the test data points correctly. Here's the code:

>>> from sklearn import metrics
>>> predicted
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> y_test
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> metrics.accuracy_score(y_test, predicted)
1.0 # 1.0 is 100 percent accuracy
>>> predicted == y_test
array([ True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True], dtype=bool)

So how does the logistic regression model with parameter C=150 compare to that? It should do better because the visualization looked better, but you can't beat 100 percent. Here is the code to create and evaluate the logistic classifier with C=150:

>>> logClassifier_2 = linear_model.LogisticRegression(
C=150, random_state=111)
>>> logClassifier_2.fit(X_train, y_train)
>>> predicted = logClassifier_2.predict(X_test)
>>> metrics.accuracy_score(y_test, predicted)
0.93333333333333335
>>> metrics.confusion_matrix(y_test, predicted)
array([[5, 0, 0],
[0, 2, 0],
[0, 1, 7]])

We expected better, but it was actually worse. There was one error in the predictions. The result is the same as that of the SVM model built earlier in the chapter.

Here is the full listing of the code to create and evaluate a logistic regression classification model with the default parameters:

>>> from sklearn.datasets import load_iris
>>> from sklearn import linear_model
>>> from sklearn import cross_validation
>>> from sklearn import metrics
>>> iris = load_iris()
>>> X_train, X_test, y_train, y_test =
cross_validation.train_test_split(iris.data,
iris.target, test_size=0.10, random_state=111)
>>> logClassifier = linear_model.LogisticRegression(,
random_state=111)
>>> logClassifier.fit(X_train, y_train)
>>> predicted = logClassifier.predict(X_test)
>>> predicted
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> y_test
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> metrics.accuracy_score(y_test, predicted)
1.0 # 1.0 is 100 percent accuracy
>>> predicted == y_test
array([ True, True, True, True, True, True, True,
True, True, True, True, True, True, True,
True], dtype=bool)

Visualizing the classifier

The Iris dataset isn't easy to graph in its original form because you can't plot all four coordinates (from the features) of the dataset onto a two-dimensional screen. Therefore you can either pick two of the four features for visualization purposes or often better you can reduce the dimensions by applying a dimensionality reduction algorithm to the features. In this case, the algorithm you'll be using to do the data transformation (reducing the dimensions of the features) is called Principal Component Analysis (PCA).

The PCA algorithm takes all four features (numbers), does some math on them, and outputs two new numbers that you can use to do the plot. Think of PCA as following two general steps:

  1. It takes as input a dataset with many features.
  2. It reduces that input to a smaller set of features (user-defined or algorithm-determined) by transforming the components of the feature set into what it considers as the main (principal) components.

This transformation of the feature set is also called feature extraction. The following code does the dimension reduction:

>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2).fit(X_train)
>>> pca_2d = pca.transform(X_train)

tip If you've already imported any libraries or datasets listed in this code section, it isn't necessary to re-import or load them in your current Python session. If you do so, however, it shouldn't affect your program.

After you run the code, you can type the pca_2d variable in the interpreter and see that it outputs arrays with two items instead of four. These two new numbers are mathematical representations of the four old numbers. With the reduced feature set, you can plot the results by using the following code:

>>> import pylab as pl
>>> for i in range(0, pca_2d.shape[0]):
>>> if y_train[i] == 0:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
marker='+')
>>> elif y_train[i] == 1:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
marker='o')
>>> elif y_train[i] == 2:
>>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b',
marker='*')
>>> pl.legend([c1, c2, c3], ['Setosa', 'Versicolor',
'Virginica'])
>>> pl.title('Iris training dataset with 3 classes and
known outcomes')
>>> pl.show()

Figure 12-16 is a scatter plot — a visualization of plotted points representing observations on a graph. This particular scatter plot represents the known outcomes of the Iris training dataset. There are 135 plotted points (observations) from our training dataset. (You can see a similar plot, using all 150 observations, in Chapter 13.) The training dataset consists of

  • 45 pluses that represent the Setosa class.
  • 48 circles that represent the Versicolor class.
  • 42 stars that represent the Virginica class.

You can confirm the stated number of classes by entering following code:

>>> sum(y_train==0)
45
>>> sum(y_train==1)
48
>>> sum(y_train==2)
42

From this plot you can clearly tell that the Setosa class is linearly separable from the other two classes. While the Versicolor and Virginica classes aren't completely separable by a straight line, they aren't overlapping by very much. From a simple visual perspective, the classifiers should do pretty well.

Figure 12-17 shows a plot of the Support Vector Machine (SVM) model trained with a dataset that has been dimensionally reduced to two features. This isn't the same SVM model that you trained earlier in the preceding section; that SVM model used all four features. Four features is a small feature set; we want to keep all four so that the data can retain most of its useful information. The plot is shown here as a visual aid.

This plot includes the decision surface for the classifier — the area in the graph that represents the decision function that SVM uses to determine the outcome of new data input. The lines separate the areas where the model will predict the particular class that a data point belongs to. The left section of the plot will predict the Setosa class, the middle section will predict the Versicolor class, and the right section will predict the Virginica class.

remember The SVM model that you created didn't use the dimensionally reduced feature set. We only use dimensionality reduction here to generate a plot of the decision surface of the SVM model — as a visual aid. The full listing of the code that creates the plot is provided as reference. It shouldn't be run in sequence with our current example if you're following along. It may overwrite some of the variables that you may already have in the session. The code to produce this plot is based on the sample code provided on the scikit-learn website. You can learn more about creating plots like these at the scikit-learn website.

Here is the full listing of the code that creates the plot:

>>> from sklearn.decomposition import PCA
>>> from sklearn.datasets import load_iris
>>> from sklearn import svm
>>> from sklearn import cross_validation
>>> import pylab as pl
>>> import numpy as np
>>> iris = load_iris()
>>> X_train, X_test, y_train, y_test =
cross_validation.train_test_split(iris.data,
iris.target, test_size=0.10, random_state=111)
>>> pca = PCA(n_components=2).fit(X_train)
>>> pca_2d = pca.transform(X_train)
>>> svmClassifier_2d =
svm.LinearSVC(random_state=111).fit(
pca_2d, y_train)
>>> for i in range(0, pca_2d.shape[0]):
>>> if y_train[i] == 0:
>>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r',
s=50,marker='+')
>>> elif y_train[i] == 1:
>>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g',
s=50,marker='o')
>>> elif y_train[i] == 2:
>>> c3 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='b',
s=50,marker='*')
>>> pl.legend([c1, c2, c3], ['Setosa', 'Versicolor',
'Virginica'])
>>> x_min, x_max = pca_2d[:, 0].min() - 1,
pca_2d[:,0].max() + 1
>>> y_min, y_max = pca_2d[:, 1].min() - 1,
pca_2d[:, 1].max() + 1
>>> xx, yy = np.meshgrid(np.arange(x_min, x_max, .01),
np.arange(y_min, y_max, .01))
>>> Z = svmClassifier_2d.predict(np.c_[xx.ravel(),
yy.ravel()])
>>> Z = Z.reshape(xx.shape)
>>> pl.contour(xx, yy, Z)
>>> pl.title('Support Vector Machine Decision Surface')
>>> pl.axis('off')
>>> pl.show()

Creating a supervised learning model with random forest

The random forest model is an ensemble model; it takes an ensemble (selection) of decision trees to create its model. The idea is to take a random sample of weak learners (a random subset of the training data) and have them vote to select the strongest and best model. The random forest model can be used for either classification or regression. In the following example, the random forest model is used to classify the Iris species.

Loading your data

This code listing will load the iris dataset into your session:

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()

Creating an instance of the classifier

The following two lines of code create an instance of the classifier. The first line imports the random forest library. The second line creates an instance of the random forest algorithm:

>>> from sklearn.ensemble import RandomForestClassifier
>>> rf = RandomForestClassifier(n_estimators=15,
random_state=111)

The n_estimators parameter in the constructor is a commonly used tuning parameter for the random forest model. The value is used to build the number of trees in the forest. It's generally between 10 and 100 percent of the dataset, but it depends on the data you're using. Here, the value is set at 15, which is 10 percent of the data. Later in the Evaluating the model section, you will see that changing the parameter value to 150 (100 percent) produces the same results.

The n_estimators is used to tune model performance and overfitting. The greater the value, the better the performance but at the cost of overfitting. The smaller the value, the higher the chances of not overfitting but at the cost of lower performance. Also, there is a point where increasing the number will generally degrade in accuracy improvement and may dramatically increase the computational power needed. The parameter defaults to 10 if it is omitted in the constructor.

As with the other classifiers created earlier in this chapter (Creating a supervised learning model with Support Vector Machines; Creating a supervised learning model with logistic regression), the steps to train, test, and evaluate are similar.

Running the training data

You'll need to split the dataset into training and test sets before you can create an instance of the random forest classifier. The following code will accomplish that task:

>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test =
cross_validation.train_test_split(iris.data,
iris.target, test_size=0.10, random_state=111)
>>> rf = rf.fit(X_train, y_train)

  1. Line 1 imports the library that allows us to split the dataset into two parts.
  2. Line 2 calls the function from the library that splits the dataset into two parts and assigns the now-divided datasets to two pairs of variables.
  3. Line 3 takes the instance of the random forest classifier you just created,then calls the fit method to train the model with the training dataset.

Running the test data

In the following code, the first line feeds the test dataset to the model, then the third line displays the output:

>>> predicted = rf.predict(X_test)
>>> predicted
array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])

Evaluating the model

You can cross-reference the output from the prediction against the y_test array. As a result, you can see that it predicted two test data points incorrectly. So the accuracy of the random forest model was 86.67 percent.

Here's the code:

>>> from sklearn import metrics
>>> predicted
array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])
>>> y_test
array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2])
>>> metrics.accuracy_score(y_test, predicted)
0.8666666666666667 # 1.0 is 100 percent accuracy
>>> predicted == y_test
array([ True, True, True, True, False, True, True,
True, True, True, True, True, False, True,
True], dtype=bool)

How does the random forest model perform if we change the n_estimators parameter to 150? It looks like it won’t make a difference for this small dataset. It produces the same result:

>>> rf = RandomForestClassifier(n_estimators=150,
random_state=111)
>>> rf = rf.fit(X_train, y_train)
>>> predicted = rf.predict(X_test)
>>> predicted
array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])

Comparing the classification models

The logistics regression and SVM classification models perform rather well using the Iris dataset. The logistic regression model with parameter C=1 was perfect in its predictions, while the SVM model and the logistic regression model with C=150 missed only one prediction. The random forest model missed two predictions and was the least accurate of the three models, but obtaining over 86 percent accuracy is still pretty good.

Indeed, the high accuracy of all three models is a result of having a small dataset that has data points that are pretty close to linearly separable. At the same time, the dataset is so small that the small differences in performance easily can be influenced by subtle randomness.

Interestingly, the logistic regression model with C=150 had a better-looking decision surface plot than the one with C=1, but it didn't perform better. That isn't such a big deal, considering that the test set is so small. If another random split between training set and test set had been selected, the results could have easily been different.

This reveals another source of complexity that crops up in model evaluation: the effect of sampling, and how choosing the training and testing sets can affect the model's output. Cross-validation techniques (see Chapter 15) can help minimize the impact of random sampling on the model's performance.

remember For a larger dataset with non-linearly separable data, you would expect the results to deviate even more. In addition, choosing the appropriate model becomes increasingly difficult due to the complexity and size of the data. Be prepared to spend a great deal of time tuning your parameters to get an ideal fit.

tip When creating predictive models, try a few algorithms and exhaustively tune their parameters until you find what works best for your data. Then compare their outputs against each other. Almost all modeling algorithms have these kinds of parameters, and fine tuning them can have a measurable impact on the final result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.103.210