Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4
Creating Machine Learning Models with Scikit-learn

WHAT'S IN THIS CHAPTER

Introduction to Scikit-learn
Learn to split your training data into training and testing sets
Learn to use k-fold cross validation
Learn to create different types of machine learning models

In Chapter 2, you learned about techniques to explore data and perform feature engineering. In this chapter you will learn to use Scikit-learn to split your training data into training and test sets, and to create different types of machine learning models. This chapter will use the Titanic and Iris datasets to illustrate different types of model-building techniques. A copy of these datasets is included with the files that accompany this chapter.

NOTE To follow along with this chapter ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-chapter4.git

Introducing Scikit-learn

Scikit-learn is a Python library that provides a number of features that are suitable for machine learning engineers and data scientists. It was developed by David Cournapeau in 2007 and today provides ready-to-use implementations of several popular machine learning algorithms such as linear regression, logistic regression, support vector machines, clustering, and random forests. In addition to providing ready-to-use implementations of popular machine learning algorithms, Scikit-learn also provides tools to split datasets into test-train subsets, implement k-fold cross validation, evaluate model performance using popular metrics, and detect outliers, and includes algorithms to aid with feature selection. Scikit-learn builds upon several other libraries such as NumPy, Pandas, and Matplotlib, and at its core is focused on model-building and -evaluation tasks, and not tasks such as data loading and visualization.

Scikit-learn is one of the reasons for the rise in number of real-world applications of machine learning. Scikit-learn's vast collection of algorithms allow you to get started with machine learning without a significant background in mathematics and statistics.

In order to use your data with Scikit-learn, it must be loaded into NumPy arrays or Pandas dataframes. You can get more information on the capabilities of Scikit-learn at https://scikit-learn.org/stable/.

Creating a Training and Test Dataset

At its heart, building a machine learning model involves creating a computer program that can draw inferences from the features during the training phase and then testing the quality of the model by making predictions. A common practice involves setting aside some of the labeled training data before the model-building phase and testing the model using this data that the model has not previously encountered. The performance of the model on this unseen data is used to determine if the model is good enough, or if improvements are needed. The benefit of having a separate training and testing set is that it ensures the model has not memorized the training examples (a phenomenon known as overfitting). It is important to note that in the case of supervised learning, both the training and test sets are labeled and the original dataset must be evenly shuffled before the subsets are created.

Scikit-learn provides a function called train_test_split() in the model_selection submodule that can be used to split a Pandas dataframe into two dataframes, one for model building and the other for model evaluation. The test_train_split() function has several parameters, most of which have default values. The most commonly used parameters are:

test_size: This value can be an integer or floating-point number. When the value is an integer, it specifies the number of elements that should be retained for the test set. When the value is a floating-point number, it specifies the percentage of the original dataset to include in the test set.
random_state: This is an integer value that is used to seed the random-number generator used to shuffle the samples.

The output of the train_test_split() function is a list of four arrays in the following order:

The first item of the list is an array that contains the training set features.
The second item of the list is an array that contains the test set features.
The third item of the list is an array that contains the training set labels (target variable).
The fourth item of the list is an array that contains the test set labels.

You can find detailed information on the parameters of the train_test_split() function at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

The following snippet demonstrates the use of this function to split the Iris flowers dataset into a training and test set, with 25% of the data reserved for the test set:

import numpy as np
import pandas as pd
# load iris data set
from sklearn.datasets import load_iris
iris_dataset = load_iris()
df_iris_features = pd.DataFrame(data = iris_dataset.data, columns=iris_dataset.feature_names)
df_iris_target = pd.DataFrame(data = iris_dataset.target, columns=['class'])
# split iris dataset
iris_split = train_test_split(df_iris_features, df_iris_target, 
                test_size=0.25, random_state=17)
df_iris_features_train = iris_split[0]
df_iris_features_test = iris_split[1]
df_iris_target_train = iris_split[2]
df_iris_target_test = iris_split[3]

You can use the dataframe's shape property to inspect the size of the training and test datasets created by the train_test_split() function:

df_iris_features.shape, df_iris_target.shape
((150, 4), (150, 1))
df_iris_features_train.shape, df_iris_target_train.shape
((112, 4), (112, 1))
df_iris_features_test.shape, df_iris_target_test.shape
((38, 4), (38, 1))

If you use the dataframe's head() method to inspect the first five rows of the original df_iris_features dataset and compare it with the first five rows of the df_iris_features_train dataset, you will notice that the train_test_split() function has automatically shuffled the data before splitting. This is illustrated in Figure 4.1.

Graphs depict Sckit-learn's train_test_split() method automatically shuffles the data prior to splitting. — **FIGURE 4.1** Scikit-learn's `train_test_split()` method automatically shuffles the data prior to splitting.

The default behavior of the train_test_split() function is to shuffle the data, then determine the boundary observation where the training set should end and prepare two datasets by splitting at this boundary position. If the problem you are trying to solve is one of multi-class classification and your original data has a disproportionate number of samples from one category over the other, then it is important to ensure that the split datasets also have similar proportions. The train_test_split() function has a parameter called stratify that can be used to achieve a stratified split, maintaining the proportions of categorical observations before and after the split. The following snippet demonstrates the use of the stratify parameter:

# iris dataset, with stratified sampling
iris_split_strat = train_test_split(df_iris_features, df_iris_target, 
                                    test_size=0.25, random_state=17, stratify=df_iris_target)
df_iris_features_train2 = iris_split_strat[0]
df_iris_features_test2 = iris_split_strat[1]
df_iris_target_train2 = iris_split_strat[2]
df_iris_target_test2 = iris_split_strat[3]

The following snippet uses Pandas' plotting functions to create a bar chart of the distribution of categories in the original dataset, the unstratified training set, and the stratified training set. Note that the distribution in the stratified set is closer to the original, though not identical. The resulting plots are depicted in Figure 4.2.

Graphs depict comparison of the distribution of target variables in the original and split datasets, with and without stratified sampling. — **FIGURE 4.2** Comparison of the distribution of target variables in the original and split datasets, with and without stratified sampling

# visualize the distribution of target values in the
# original dataset and the training sets created by the train_test_split 
# function, with and without stratification
# use Pandas dataframe functions to plot a bar chart of the 'Embarked' attribute
fig, axes = plt.subplots(1, 3, figsize=(15,5))
axes[0].set_title('df_iris_target')
df_iris_target['class'].value_counts(dropna=False).plot.bar(grid=True, ax=axes[0])
axes[1].set_title('df_iris_target_train')
df_iris_target_train['class'].value_counts(dropna=False).plot.bar(grid=True, ax=axes[1])
axes[2].set_title('df_iris_target_train2')
df_iris_target_train2['class'].value_counts(dropna=False).plot.bar(grid=True, ax=axes[2])

The distribution of target values between the three categories of flowers is identical in the Iris flowers dataset. This can be seen in the first histogram in Figure 4.2, with each category having 50 values. To better illustrate the use of stratified sampling, the following snippet loads the toy version of the UCI ML wines dataset and plots the difference between stratified and unstratified splits. The UI ML wines dataset is another popular dataset used by beginners for multi-class classification problems. It contains a number of numeric features that contain the results of chemical analysis on wines grown in four different regions of Italy, and a categorical target that indicates the overall quality of the wine. You can find information on the attributes of this dataset at https://scikit-learn.org/stable/datasets/index.html. The resulting plots are depicted in Figure 4.3.

Graphs depict comparison of the distribution of target variables in the original and split versions of the UCI ML wines dataset, with and without stratified sampling. — **FIGURE 4.3** Comparison of the distribution of target variables in the original and split versions of the UCI ML wines dataset, with and without stratified sampling

# Load the UCI ML Wines dataset
from sklearn.datasets import load_wine
wine_dataset = load_wine()
df_wine_features = pd.DataFrame(data = wine_dataset.data, columns=wine_dataset.feature_names)
df_wine_target = pd.DataFrame(data = wine_dataset.target, columns=['class'])
#wines dataset
wines_split = train_test_split(df_wine_features, df_wine_target, 
                               test_size=0.25, random_state=17)
df_wine_features_train = wines_split[0]
df_wine_features_test = wines_split[1]
df_wine_target_train = wines_split[2]
df_wine_target_test = wines_split[3]
# wines dataset, with stratified sampling
wines_split_strat = train_test_split(df_wine_features, df_wine_target, 
                                     test_size=0.25, random_state=17, stratify=df_wine_target)
df_wine_features_train2 = wines_split_strat[0]
df_wine_features_test2 = wines_split_strat[1]
df_wine_target_train2 = wines_split_strat[2]
df_wine_target_test2 = wines_split_strat[3]
 
# visualize the distribution of target values in the
# original wines dataset and the training sets created by the train_test_split 
# function, with and without stratification
# use Pandas dataframe functions to plot a bar chart of the 'Embarked' attribute
fig, axes = plt.subplots(1, 3, figsize=(15,5))
axes[0].set_title('df_wine_target')
df_wine_target['class'].value_counts(dropna=False).plot.bar(grid=True, ax=axes[0])
axes[1].set_title('df_wine_target_train')
df_wine_target_train['class'].value_counts(dropna=False).plot.bar(grid=True, ax=axes[1])
axes[2].set_title('df_wine_target_train2')
df_wine_target_train2['class'].value_counts(dropna=False).plot.bar(grid=True, ax=axes[2])

K-Fold Cross Validation

The main drawback with the idea of splitting the training dataset into a training and validation set is that it is possible for the samples in the training set to exhibit characteristics that may not be found in any of the samples in the test set. While shuffling the data can help mitigate this, the extent of the mitigation depends on various factors such as the size of the original dataset, and the proportion of samples that exhibit a particular characteristic. The solution to avoid creating a model that is susceptible to characteristics only found in the training set is embodied in a technique called k-fold cross validation. At a very high level, the k-fold cross validation technique works as follows:

Choose a value of k.
Shuffle the data.
Split the data into k equal subsets.
For each value of k:
1. Train a model that uses the kth subset as the test set and the samples of the k-1 subsets as the training set.
2. Record the performance of the model when making predictions on the kth subset.
Compute the mean performance of the individual models to work out the overall performance.

K-fold cross validation can help minimize the possibility of the model picking up on unexpected bias in the training set. The idea behind k-fold cross validation is to shuffle the entire dataset randomly and divide it into a number of smaller sets (known as folds) and train multiple models (or the same model multiple times). During each training and evaluation cycle, one of the folds will be held out as the test set and the remaining will make the training set. This is illustrated in Figure 4.4.

Chart depicts cross-validation using k-folds for both test set and training set. — **FIGURE 4.4** Cross-validation using k-folds

If k=1, then the k-fold cross validation approach becomes similar to the train/test split method discussed earlier in this chapter. If k=n, the number of samples in the training set, then in effect the test set contains only one sample, and each sample will get to be part of the test set during one of the iterations. This technique is also known as leave-one-out cross validation. Many academic research papers use k=5 or k=10; however, there is no hard-and-fast rule governing the value of k.

Scikit-learn provides a class called KFold as part of the model-building module that can be used to create the folds and enumerate through the folds. The constructor for the KFold class takes three parameters:

n_splits: An integer that represents the number of folds required.
shuffle: An optional Boolean value that indicates whether the data should be shuffled before the folds are created.
random_state: An optional integer that is used to seed the random-number generator used to shuffle the data.

The KFold class provides two methods:

get_n_splits(): Returns the number of folds
split(): Gets the indices of the training and test set members for each fold.

You can find more information on the KFold class at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html. The following snippet demonstrates the use of the KFold class to split the contents of the Iris dataset into 10 folds and generate the test and training sets:

# perform 10-fold split on the Iris dataset
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True)
fold_number = 1
for train_indices, test_indices in kf.split(df_iris_features):
    
    print("Fold number:", fold_number)
    print("Training indices:", train_indices)
    print("Testing indices::", test_indices)
    
    fold_number = fold_number + 1
    
    df_iris_features_train = df_iris_features.iloc[train_index]
    df_iris_target_train = df_iris_target.iloc[train_index]
    
    df_iris_features_test = df_iris_features.iloc[test_index]
    df_iris_target_test = df_iris_target.iloc[test_index]

You can inspect the index positions that constitute the training and test sets for each iteration. The indices for the first two iterations are presented here:

Fold number: 1
Training indices: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18
  19  20  21  22  23  26  27  28  29  31  32  33  34  35  36  38  39  41
  42  43  44  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
  61  62  63  64  65  66  68  69  70  71  72  73  74  75  76  77  78  79
  80  82  83  84  85  86  87  88  89  90  91  94  95  96  97  98  99 100
 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
 120 121 122 123 124 125 127 128 129 130 131 132 133 134 135 136 137 138
 139 140 141 142 143 145 146 147 148]
Testing indices:: [ 17  24  25  30  37  40  45  67  81  92  93 119 126 144 149]
 
Fold number: 2
Training indices: [  1   2   3   4   5   6   7   8   9  10  12  13  14  16  17  18  19  20
  21  22  23  24  25  26  27  28  29  30  31  32  34  35  36  37  39  40
  41  42  43  44  45  46  47  48  50  51  52  53  55  56  57  58  59  60
  61  62  63  64  65  67  68  69  70  71  73  74  75  76  77  78  80  81
  83  84  85  86  87  88  89  91  92  93  94  95  96  97  98  99 100 101
 102 103 104 105 106 107 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 128 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149]
Testing indices:: [  0  11  15  33  38  49  54  66  72  79  82  90 108 127 129]

Creating Machine Learning Models

In this section you will learn to create models that can be used to both predict the value of the target variable from the feature variables and classify data. A selection of models will be looked at, some of which assume a linear relationship between the target and the features, as well as models that can be used when the relationship is nonlinear.

Linear Regression

Linear regression is a statistical technique that aims to find the equation of a line (or hyperplane) that is closest to all the points in the dataset. To understand how linear regression works, let's assume you have a training dataset of 100 rows, and each row consists of three features, Y1, Y2, and Y3, and a known target value X. Linear regression will assume that the relationship between the target variable X and input features Y1, Y2, and Y3 is linear, and can be expressed by this equation:

X_i = αY1_i + βY2_i + γY3_i + ε.

where:

X_i is the predicted value of the i^th target variable.
Y1_i is the i^th value of feature Y1.
Y2_i is the i^th value of feature Y2.
Y3_i is the i^th value of feature Y3.
α, β, γ are the coefficients of the features Y1, Y2, Y3.
ε is a constant term, also known as the bias term or intercept.

The training process will iterate over the entire training set multiple times and calculate the best values of α, β, γ, and ε. A set of values is considered better if they minimize an error function. An error function is a mathematical function that captures the difference between the predicted and actual values of X_i. Root mean square error (RMSE) is a commonly used error function and is expressed mathematically as:

In effect, linear regression attempts to find the best line (or hyperplane in higher dimensions) that fits all the data points. The output of linear regression is a continuous, unbounded value. It can be a positive number or a negative number, and it can have any value, depending on the inputs with which the model was trained. Therefore, linear regression models are commonly used to predict a continuous numeric value.

Scikit-learn implements linear regression in a class called LinearRegression, which is part of the linear_model module. We will now use this class to implement a linear regression model on the popular Boston housing dataset. The dataset consists of 506 rows, and each row consists of 13 continuous numeric features that contain information such as the per-capita crime rate, average number of rooms per house, rate of property tax, and pupil-teacher ratio. The target value contains the median house price of owner-occupied homes in various parts of Boston.

The dataset does not contain any missing values, and Scikit-learn includes the entire dataset as part of its datasets module. You can find more information on the attributes of this dataset at https://scikit-learn.org/stable/datasets/index.html. Recall from Chapter 2 that Scikit-learn provides a function called DESCR that can be used to print the description of a toy dataset. The following snippet loads the Boston housing dataset and uses the DESCR function to print the description of the dataset:

# load boston house dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
df_boston_features = pd.DataFrame(data = boston_dataset.data, columns=boston_dataset.feature_names)
df_boston_target = pd.DataFrame(data = boston_dataset.target, columns=['price'])
# print a description of the dataset.
print(boston_dataset.DESCR)
Boston house prices dataset
---------------------------
**Data Set Characteristics:**  
    :Number of Instances: 506 
    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None
    :Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
 
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
…', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References
   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Creating the linear model involves splitting the 506 rows of the Boston housing dataset into a training set and a validation set, and using the training set to train the model. The following code snippet creates a 75/25 split of the 506 rows and uses 75% of the original data to train a linear regression model:

# create a training dataset and a test dataset using a 75/25 split.
from sklearn.model_selection import train_test_split
boston_split = train_test_split(df_boston_features, df_boston_target, 
                              test_size=0.25, random_state=17)
df_boston_features_train = boston_split[0]
df_boston_features_test = boston_split[1]
df_boston_target_train = boston_split[2]
df_boston_target_test = boston_split[3]
# train a linear model
from sklearn.linear_model import LinearRegression
linear_regression_model = LinearRegression(fit_intercept=True)
linear_regression_model.fit(df_boston_features_train, df_boston_target_train)

You can instantiate a linear regression model by using the class constructor. The constructor has four parameters, all of which are optional. In most cases, you will instantiate a LinearRegression instance using the default zero-parameter constructor:

linear_regression_model = LinearRegression()

In the preceding snippet, the fit_intercept constructor parameter is set to True. The fit_intercept parameter is used to indicate that the samples are not zero-centered and that the model should calculate the intercept term. If fit_intercept is False, then the model will assume the y-axis intercept is 0 and will only attempt to fit lines (or hyperplanes) that satisfy this constraint. You can find more information on the constructor parameters at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

Once you have created a LinearRegression instance, training the model is a simple matter of calling the fit() method, and passing in a dataframe that contains the features and a dataframe that contains the known target values. Once training is complete, you can access the coefficients and intercept terms of the linear model using the coef_ and intercept_ attributes of the LinearRegression instance. The following snippet lists the coefficients and intercept terms after training a linear regression model using the Boston house prices dataset. Note there are 13 coefficients, corresponding to the 13 feature variables:

print (linear_regression_model.coef_)
[[-1.12960344e-01  5.48578928e-02  6.71605489e-02  3.26195457e+00
  -1.70702665e+01  3.49123817e+00  7.03121906e-05 -1.37355630e+00
   3.12880217e-01 -1.32867294e-02 -9.57749225e-01  7.70369247e-03
  -5.59461017e-01]]
print (linear_regression_model.intercept_)
[38.51522467]

Once you have the trained model, you can use the model to make predictions. The following snippet uses the linear_regression_model object to make predictions on the test set (25% of the original 506 samples):

# use the linear model to create predictions on the test set.
predicted_median_house_prices = linear_regression_model.predict(df_boston_features_test)

You will learn about ways to evaluate machine learning models in Chapter 5, but in this case you could get an idea of the quality of predictions created by this model by creating a scatter plot of the predictions made by the model on the test set against the actual house prices in the test set. The following snippet uses functions from Matplotlib's pyplot module to create a scatter plot. The resulting plot is depicted in Figure 4.5.

Graph depicts scatter plot of expected vs. predicted house prices, with expected prices on x axis and predicted prices on y axis. — **FIGURE 4.5** Scatter plot of expected vs. predicted house prices

%matplotlib inline
import matplotlib.pyplot as plt
# use pyplot module to create a scatter plot of predicted vs expected values
fig, axes = plt.subplots(1, 1, figsize=(9,9))
plt.scatter(df_boston_target_test, predicted_median_house_prices)
plt.xlabel("Expected Prices")
plt.ylabel("Predicted prices")
plt.title("Expected vs Predicted prices")

In the ideal case, the scatter plot of the expected house prices against the values predicted by your model should be close to a straight line.

To better understand the result of changing the fit_intercept parameter while creating the LinearRegression instance, the following snippet creates a synthetic dataset of 50 random two-dimensional points and attempts to create two linear regression models on the data. The first model is created with fit_intercept = False, and the second model is created with fit_intercept = True. Both models are presented the X and Y coordinates of the 50 points as training data, with the X coordinate values representing the feature variable and the Y coordinate values representing the target.

In effect, after training, the models will be able to predict the Y coordinate from the X coordinate. Since the X coordinate is the only input feature of this model, the model will contain only one coefficient term. The snippet then creates a scatter plot of the 50 points and overlays the regression line generated by each model. The resulting plot is depicted in Figure 4.6.

Graph depicts scatter plot of synthetic dataset along with regression lines. — **FIGURE 4.6** Scatter plot of synthetic dataset along with regression lines

# create a synthetic regression dataset of X, Y values.
from sklearn.datasets import make_regression
from sklearn.preprocessing import MinMaxScaler
SyntheticX, SyntheticY = make_regression(n_samples=50, n_features=1, noise=35.0, random_state=17)
x_scaler = MinMaxScaler()
x_scaler.fit(SyntheticX.reshape(-1,1))
SyntheticX = x_scaler.transform(SyntheticX.reshape(-1,1))
y_scaler = MinMaxScaler()
y_scaler.fit(SyntheticY.reshape(-1,1))
SyntheticY = y_scaler.transform(SyntheticY.reshape(-1,1))
        
# demonstrate effect of fit_intercept parameter on a simple synthetic dataset.
linear_regression_model_synthetic1 = LinearRegression(fit_intercept=True)
linear_regression_model_synthetic1.fit(SyntheticX, SyntheticY)
linear_regression_model_synthetic2 = LinearRegression(fit_intercept=False)
linear_regression_model_synthetic2.fit(SyntheticX, SyntheticY)
c1 = linear_regression_model_synthetic1.coef_
i1 = linear_regression_model_synthetic1.intercept_
YPredicted1 = np.dot(SyntheticX, c1) + i1
c2 = linear_regression_model_synthetic2.coef_
i2 = linear_regression_model_synthetic2.intercept_
YPredicted2 = np.dot(SyntheticX, c2) + i2
# use pyplot module to create a scatter plot of synthetic dataset
# and overlay the regression line from the two models.
fig, axes = plt.subplots(1, 1, figsize=(9,9))
axes.axhline(y=0, color='k')
axes.axvline(x=0, color='k')
plt.scatter(SyntheticX, SyntheticY)
plt.plot(SyntheticX, YPredicted1, color='#042fed', label='fit_intercept=True')
plt.plot(SyntheticX, YPredicted2, color='#d02fed', label='fit_intercept=False')
plt.legend()
plt.xlabel("X")
plt.ylabel("Y")

In Figure 4.6, you can see that the regression line generated by the model with fit_intercept = False is anchored at Y = 0. The model is therefore constrained in terms of the lines it can generate. On the other hand, the line generated by the model with fit_intercept = True is not anchored at Y = 0, and therefore the model is able to determine the best value for the Y intercept as a result of the training process.

Support Vector Machines

A support vector machine (SVM) is a versatile model that can be used for a variety of tasks, including classification, regression, and outlier detection. The original algorithm was invented in 1963 by Vladimir Vapnik and Alexey Chervonenkis as a binary classification algorithm. During the training process, support vector machine models aim to create a decision boundary that can partition the data points into classes. If the dataset has just two features, then this decision boundary is two-dimensional and can be conveniently represented in a scatter plot. If the decision boundary is linear, it will take the form of a straight line in two dimensions, a plane in three dimensions, and a hyperplane in n-dimensions. As humans, we cannot visualize more than three dimensions, which is why in order to understand how SVMs work, we'll consider a two-dimensional example with a fictional dataset with two features, and each point belonging to one of two classes. Let's also assume that the data is linearly separable—that is, you can draw a line that can separate them. Figure 4.7 depicts a scatter plot of feature values of points from this fictional dataset, with one class of observations represented as circles and the other as stars. The figure also presents three possible linear decision boundaries, each capable of separating the observations into two different sets.

Graphs depict three potential decision boundaries. — **FIGURE 4.7** Three potential decision boundaries

The decision boundary to the left is too close to the first set of observations and there is a risk that a model with that decision boundary could misclassify real-world observations were they only slightly different from the training set. The decision boundary to the right has a similar problem in that it is too close to the second set of observations. The decision boundary in the middle is optimal because it is as far away as possible from both classes and at the same time clearly separates both classes. SVM models aim to find this middle (optimal), or to put it in another way, aim to find the decision boundary that maximizes the distance between the two classes of observations on either side. The half-width of the margin is denoted by the Greek letter ε (epsilon). The points on the edges of the margin are called support vectors (the vector is assumed to originate at the origin and terminate at these points). In effect one could say these vectors are supporting the margins—hence the name support vectors.

Most real-world data is not linearly separable as the dataset depicted in Figure 4.7, and therefore linear decision boundaries are unable to clearly separate the classes. Furthermore, even when the data is linearly separable, there is a possibility that the margin of separation is not as wide as the fictional example in Figure 4.7. Data points are often too close together to allow for wide, clear margins between the decision boundary and the support vectors on either side, and to handle this, SVM implementations include the concept of a tolerance parameter that controls the number of support vectors that can be inside the margins, which in turn has an impact on the width of the margin. Setting a large tolerance value results in a wider margin with more samples in the margin, whereas setting a small tolerance value will result in a narrow margin. Having a wide margin is not necessarily a bad thing, as long as most of the points in the margin are on the correct side of the decision boundary.

Scikit-learn provides an implementation of support vector machine–based classifiers in the SVC class, which is part of the sklearn.svm module. We will now use this class to implement an SVM-based classification model on the popular Pima Indians diabetes dataset. The database consists of eight feature variables that represent various medical measurements such as blood pressure, plasma glucose concentration, BMI, and insulin levels, and contains a binary target variable called Outcome, which indicates whether the individual in question has diabetes. The dataset was originally created by the National Institute of Diabetes and Digestive and Kidney Diseases, and you can find the Kaggle version of the dataset at https://www.kaggle.com/uciml/pima-indians-diabetes-database. A copy of the dataset has been included with the resources that accompany this chapter.

The following snippet can be used to load the dataset from a CSV file, create Pandas dataframes with the feature and target data, normalize the feature data, and create a 75/25 test-train split:

# load Pima Indians Diabetes dataset
diabetes_dataset_file = './datasets/diabetes_dataset/diabetes.csv'
df_diabetes = pd.read_csv(diabetes_dataset_file)
df_diabetes_target = df_diabetes.loc[:,['Outcome']]
df_diabetes_features = df_diabetes.drop(['Outcome'], axis=1)
# normalize attribute values
from sklearn.preprocessing import MinMaxScaler
diabetes_scaler = MinMaxScaler()
diabetes_scaler.fit(df_diabetes_features)
nd_diabetes_features = diabetes_scaler.transform(df_diabetes_features)
df_diabetes_features_normalized = pd.DataFrame(data=nd_diabetes_features, columns=df_diabetes_features.columns)
# create a training dataset and a test dataset using a 75/25 split.
diabetes_split = train_test_split(df_diabetes_features_normalized, df_diabetes_target, 
                              test_size=0.25, random_state=17)
df_diabetes_features_train = diabetes_split[0]
df_diabetes_features_test = diabetes_split[1]
df_diabetes_target_train = diabetes_split[2]
df_diabetes_target_test = diabetes_split[3]

The following snippet creates an instance of the SVC class that attempts to find a linear decision boundary, trains the SVC instance using the training set (75% of the samples), and uses the predict() method to make predictions on the test set. You can learn more about instantiating an SVC instance at https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.

# create an SVM classifier for the features of the diabetes dataset using a linear kernel
from sklearn.svm import SVC
svc_model = SVC(kernel='linear', C=1)
svc_model.fit(df_diabetes_features_train, df_diabetes_target_train)
# use the SVC model to create predictions on the test set.
predicted_diabetes = svc_model.predict(df_diabetes_features_test)

Chapter 5 covers techniques to evaluate the performance of classification models, but for now you can examine the predictions themselves using the Python print() function:

print (predicted_diabetes)
[0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0
 0 0 0 0 1 0 0]

The real power of SVM-based classifiers is their ability to create nonlinear decision boundaries. Support vector models use a mathematical function called a kernel that is used to transform each input point into a higher-dimensional space where a linear decision boundary can be found. This will be easier to understand with an example. Figure 4.8 shows a scatter plot of another fictional dataset with two features per data point, and each data point belonging to one of two classes. In this case, it is quite clear that there is no linear decision boundary (straight line) that can classify all data points correctly—no matter which way you draw a straight line you will always end up with some samples on the wrong side of the line.

Chart depicts that data that cannot be classified using a linear decision boundary in two-dimensional space. — **FIGURE 4.8** Data that cannot be classified using a linear decision boundary in two-dimensional space

If, however, you add an extra dimension (z-axis) to the data, and compute z values for each point using the equation z = x² + y², then the data becomes linearly separable along the z-axis using a plane at z = 0.3. This is depicted in Figure 4.9.

Chart depicts data that cannot be classified using a linear decision boundary in two-dimensional space can be classified in three-dimensional space. — **FIGURE 4.9** Data that cannot be classified using a linear decision boundary in two-dimensional space can be classified in three-dimensional space.

Since z was computed as x² + y², the decision plane at z = 0.3 implies x² + y² = 0.3, which is nothing but the equation of a circle in two dimensions. Therefore, the linear decision boundary in three-dimensional space has become a nonlinear decision boundary in two-dimensional space. This is illustrated in Figure 4.10.

Chart depicts the nonlinear decision boundary in two-dimensional space. — **FIGURE 4.10** Nonlinear decision boundary in two-dimensional space

This is an oversimplification of how kernels work, and if you are interested in learning more about SVM kernels you should read An Introduction to Support Vector Machines and Other Kernel-based Learning Methods by Nello Cristianini and John Shawe-Taylor (https://www.cambridge.org/core/books/an-introduction-to-support-vector-machines-and-other-kernelbased-learning-methods/A6A6F4084056A4B23F88648DDBFDD6FC).

Scikit-learn allows you to choose from a number of common kernels when creating the SVC instance, including linear, polynomial, radial basis functions, and custom kernels. You can find out more about the different types of kernel functions at https://scikit-learn.org/stable/modules/svm.html#svm-kernels. In order to visualize the effect of different kernels, let's train multiple SVM classifiers with different kernel functions on a dataset. The following snippet trains multiple SVM-based classifiers on a synthetic binary classification dataset with two features. The first classifier uses a linear kernel, the second classifier uses a 2^nd-degree polynomial kernel, the third classifier uses a 15^th-degree polynomial kernel, and the fourth kernel uses a radial basis function (RBF) kernel:

# create a synthetic binary classification dataset with 2 features.
from sklearn.datasets import make_classification
Synthetic_BinaryClassX, Synthetic_BinaryClassY = make_classification(n_samples=50, n_features=2, n_redundant=0, n_classes=2)
 
# scale synthetic dataset between -3, 3
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-3,3))
scaler.fit(Synthetic_BinaryClassX.reshape(-1,1))
Synthetic_BinaryClassX = scaler.transform(Synthetic_BinaryClassX)
# create multiple SVM classifiers
from sklearn.svm import SVC
svc_model_linear = SVC(kernel='linear', C=1, gamma='auto')
svc_model_polynomial2 = SVC(kernel='poly', degree=2, C=1, gamma='auto')
svc_model_polynomial15 = SVC(kernel='poly', degree=15, C=1, gamma='auto')
svc_model_rbf = SVC(kernel='rbf', C=1, gamma='auto')
svc_model_linear.fit(Synthetic_BinaryClassX, Synthetic_BinaryClassY)
svc_model_polynomial2.fit(Synthetic_BinaryClassX, Synthetic_BinaryClassY)
svc_model_polynomial15.fit(Synthetic_BinaryClassX, Synthetic_BinaryClassY)
svc_model_rbf.fit(Synthetic_BinaryClassX, Synthetic_BinaryClassY)

With the models created, we can use Matplotlib functions to plot the data points as well as decision boundaries of each classifier. The following snippet shows how to visualize the decision boundary of an SVM classifier (portions of the code are taken from https://scikit-learn.org/stable/auto_examples/exercises/plot_iris_exercise.html). The resulting plots are depicted in Figure 4.11.

Illustration of effect of kernel choice on decision boundaries. — **FIGURE 4.11** Effect of kernel choice on decision boundaries

#
# portions of this code are from
# source: https://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
#
%matplotlib inline
import matplotlib.pyplot as plt
def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.
    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out
def make_meshgrid(x, y, h=.02):      """Create a mesh of points to plot in
    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional
    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy
# pick out 2 features X0 and X1
X0 = Synthetic_BinaryClassX[:,0]
X1 = Synthetic_BinaryClassX[:,1]
xx, yy = make_meshgrid(X0, X1, 0.02)#np.meshgrid(np.arange(-3, 3, 0.002),  np.arange(-3, 3, 0.002))
 
fig, axes = plt.subplots(2, 2, figsize=(16,16))
# plot linear kernel
plot_contours(axes[0,0], svc_model_linear, 
              xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)   
axes[0,0].scatter(X0, X1, s=30,  c=Synthetic_BinaryClassY)
axes[0,0].set_xlim(xx.min(), xx.max())
axes[0,0].set_ylim(yy.min(), yy.max())
axes[0,0].set_title('Linear Kernel')
    
# plot 2nd degree polynomial kernel    
plot_contours(axes[0,1], svc_model_polynomial2, xx, yy, 
              cmap=plt.cm.coolwarm, alpha=0.8)
axes[0,1].scatter(X0, X1, s=30, c=Synthetic_BinaryClassY)
axes[0,1].set_xlim(xx.min(), xx.max())
axes[0,1].set_ylim(yy.min(), yy.max())
axes[0,1].set_title('2nd Degree Polynomial Kernel')
# plot 15 degree polynomial kernel
plot_contours(axes[1,0], svc_model_polynomial15, xx, yy, 
              cmap=plt.cm.coolwarm, alpha=0.8)
axes[1,0].scatter(X0, X1, s=30, c=Synthetic_BinaryClassY)
axes[1,0].set_xlim(xx.min(), xx.max())
axes[1,0].set_ylim(yy.min(), yy.max())
axes[1,0].set_title('5th Degree Polynomial Kernel')
# plot RBF kernel
plot_contours(axes[1,1], svc_model_rbf, xx, yy, 
              cmap=plt.cm.coolwarm, alpha=0.8)  axes[1,1].scatter(X0, X1, s=30, c=Synthetic_BinaryClassY)
axes[1,1].set_xlim(xx.min(), xx.max())
axes[1,1].set_ylim(yy.min(), yy.max())
axes[1,1].set_title('RBF Kernel')
plt.show()

Support vector machine–based models can also be used for linear regression tasks. When used for linear regression tasks, you are no longer looking for a decision boundary that can spilt the points between classes, but instead the line (or hyperplane) that best fits the samples. Support vector regression (SVR) is a technique that attempts to find the best line, or hyperplane, that fits the training variables. The difference between linear regression and SVR is the manner in which this hyperplane is determined. Linear regression in two dimensions fundamentally attempts to find the line that minimizes the sum of the distances of the data points from the line. SVR, on the other hand, attempts to find the line that contains the largest number of data points within a fixed distance from the hyperplane. Figure 4.12 illustrates the difference between linear regression and SVR in a two-dimensional scenario.

Graphs depict the linear regression vs. support vector regression. — **FIGURE 4.12** Linear regression vs. support vector regression

Scikit-learn provides an implementation of support vector machine–based regressors in the SVR class, which is part of the sklearn.svm module. The following snippet will use the SVR class to implement an SVM-based regression model on the Boston housing prices dataset to predict the median house price. A scatter plot of the predicted vs. actual house prices is presented in Figure 4.13.

Graph depicts SVR predictions on Boston housing dataset, with expected prices in x axis and predicted prices in y axis. — **FIGURE 4.13** SVR predictions on Boston housing dataset

# train a linear model on the Boston house prices dataset.
from sklearn.svm import SVR
svr_model = SVR(kernel='linear', C=1.5, gamma='auto', epsilon=1.5)
svr_model.fit(df_boston_features_train, df_boston_target_train.values.ravel())
# use the SVR model to create predictions on the test set.
svr_predicted_prices = svr_model.predict(df_boston_features_test)
%matplotlib inline
import matplotlib.pyplot as plt
# use pyplot module to create a scatter plot of predicted vs expected values
fig, axes = plt.subplots(1, 1, figsize=(9,9))
plt.scatter(df_boston_target_test, predicted_median_house_prices)
plt.xlabel("Expected Prices")
plt.ylabel("Predicted prices")
plt.title("Expected vs Predicted prices")

Logistic Regression

Logistic regression, despite having the word regression in its name, is a technique that can be used to build binary and multi-class classifiers. Logistic regression (also known as logit regression) builds upon the output of linear regression and returns a probability that the data point is of one class or another. Recall that the output of linear regression is a continuous unbounded value, whereas probabilities are continuous bounded values—bounded between 0.0 and 1.0.

In order to use a continuous value for binary classification, logistic regression converts it into a probability value between 0.0 and 1.0 by feeding the output of linear regression into a logistic function. In statistics a logistic function is a type of function that converts values from [–infinity, + infinity] to [0, 1]. In the case of logistic regression, the logistic function is the sigmoid function, which is defined as:

The graph of the sigmoid function is presented in Figure 4.14. The output of the sigmoid function will never go below 0.0 or above 1.0, regardless of the value of the input.

Graph depicts the sigmoid function for both positive and negative values. — **FIGURE 4.14** The sigmoid function

The output of the sigmoid function can be used for binary classification by setting a threshold value and treating all values below that as class A and everything above the threshold as class B (Figure 4.15).

Graph depicts using the sigmoid function for binary classification, for both positive and negative values. — **FIGURE 4.15** Using the sigmoid function for binary classification

Scikit-learn provides the LogisticRegression class as part of the linear_model module. We will now use this class to implement a logistic regression–based binary classification model on the popular Pima Indians diabetes dataset:

# load Pima Indians Diabetes dataset
diabetes_dataset_file = './datasets/diabetes_dataset/diabetes.csv'
df_diabetes = pd.read_csv(diabetes_dataset_file)
df_diabetes_target = df_diabetes.loc[:,['Outcome']]
df_diabetes_features = df_diabetes.drop(['Outcome'], axis=1)
# normalize attribute values
from sklearn.preprocessing import MinMaxScaler
diabetes_scaler = MinMaxScaler()
diabetes_scaler.fit(df_diabetes_features)
nd_diabetes_features = diabetes_scaler.transform(df_diabetes_features)
df_diabetes_features_normalized = pd.DataFrame(data=nd_diabetes_features, columns=df_diabetes_features.columns)
# create a training dataset and a test dataset using a 75/25 split.
diabetes_split = train_test_split(df_diabetes_features_normalized, df_diabetes_target, 
                              test_size=0.25, random_state=17)
df_diabetes_features_train = diabetes_split[0]
df_diabetes_features_test = diabetes_split[1]
df_diabetes_target_train = diabetes_split[2]
df_diabetes_target_test = diabetes_split[3]
# train a logistic regression model on the diabetes dataset.
from sklearn.linear_model import LogisticRegression
logistic_regression_model = LogisticRegression(penalty='l2', fit_intercept=True, solver='liblinear')
logistic_regression_model.fit(df_diabetes_features_train, df_diabetes_target_train.values.ravel())
# use the  model to create predictions on the test set, with a threshold of 0.5
logistic_regression_predictions = logistic_regression_model.predict(df_diabetes_features_test)

The constructor of the LogisticRegression class takes several parameters, some of which are common with the LinearRegression class. You can find out more about these parameters at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. The predictions made by the model can be examined using the Python print() function:

print (logistic_regression_predictions)
[0 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1
 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0
 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1
 0 0 0 0 1 0 0]

The binary predictions are made using a probability cut-off of 0.5. If the probability value estimated by the underlying linear regression model is > 0.5, the output class will be 0. Scikit-learn does not allow you to change this probability cut-off; however, you can use the predict_proba() method of the LogisticRegression instance to access the prediction probabilities before the thresholding operation was applied:

# access class-wise probabilities
logistic_regression_probabilities = logistic_regression_model.predict_proba(df_diabetes_features_test)

Since there are two output classes, the predict_proba() method will give you two probabilities per data point. The first column contains the probability that the point will be labeled 0, and the second column contains the probability that the point will be labeled 1:

print (logistic_regression_probabilities)
[[0.85694005 0.14305995]
 [0.37165061 0.62834939]
 [0.73695232 0.26304768]
 [0.880803   0.119197  ]
…
…
[0.41292724 0.58707276]
 [0.63547121 0.36452879]
 [0.52728275 0.47271725]]

Because these numbers represent probabilities, the sum of the prediction probabilities for any data point will be 1.0. Furthermore, since there are only two classes, you can use the information in any one column to work out the value of the other column by subtracting from 1.0. The following snippet uses the information in the first column (probability that the output class is 0) and implements custom thresholding logic at 0.8. Any probabilities greater than 0.8 will be labeled 0:

# implement custom thresholding logic
dfProbabilities = pd.DataFrame(logistic_regression_probabilities[:,0])
predictions = dfProbabilities.applymap(lambda x: 0 if x > 0.8 else 1)

You can examine the predictions with this new threshold of 0.8 by printing the contents of predictions. Compare these predictions with the predictions made by the model with Scikit-learn's default cut-off threshold of 0.5:

print (predictions.values.ravel())
[0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 0 1 0 1
 1 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0
 0 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0
 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1
 1 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1
 1 1 1 1 1 1 1]

As mentioned earlier, logistic regression builds upon the output of linear regression. You can inspect the coefficients and intercept of the underlying linear model through the coef_ and intercept_ attributes of the model:

print (logistic_regression_model.coef_)
[[ 1.48972976  3.4891602  -0.7344297  -0.07461329  0.16776565  1.81409369
   1.39383873  1.03554067]]
print (logistic_regression_model.intercept_)
[-4.06714158]

Logistic regression is inherently a binary classifier, but it can be used as a multi-class classifier for datasets where the target variable can belong to more than two classes. There are two fundamental approaches that can be used to use a binary classifier for multi-class problems:

One-versus-rest approach: This is also known as the OVR approach, and it involves creating a number of binary classification models, with each model predicting the probability that the output is one of the subclasses. This approach will create N models for N classes, and the final class output by the multi-class classifier corresponds to the model that predicted the highest probability. Consider the popular Iris flowers dataset, where the target variable can have one of three values [0, 1, 2], corresponding to the type of Iris flower. In this case, the OVR approach would involve training three logistic regression models. The first model would predict the probability that the output class is 0 or not 0. Likewise, the second model would only predict the probability that the output class is 1 or not 1, and so on. The one-versus-rest approach is sometimes also referred to as the one-versus-all (OVA) approach
One-versus-one approach: This is known as the OVO approach, and it also involves creating a number of binary classification models and picking the class that corresponds to the model that outputs the largest probability value. The difference between the OVO approach and the OVR approach is in the number of models created. The OVO approach creates one model for each pairwise combination of output classes. In the case of the Iris flowers dataset, the OVO approach would also create three models:
- Logistic regression model that predicts output class as 0 or 1
- Logistic regression model that predicts output class as 0 or 2
- Logistic regression model that predicts output class as 1 or 2

As you can see, the number of models generated increases with the number of features.

Scikit-learn provides the OneVsOneClassifier and the OneVsRestClassifier classes in the multiclass module that encapsulate the complexity of creating multiple binary classification models and training them. The constructor for these classes takes a binary classification model as input. Some model classes within Scikit-learn are inherently capable of multi-class classification, and you may be surprised to learn that LogisticRegression is one of them. However, before we discuss the implementation of inherent multi-class classification in the LogisticRegression class, let's examine how we can use the OneVsRestClassifier to create an ensemble of binary LogisticRegression models and use the ensemble as a multi-class classifier. You can learn more about the classes in the multiclass package at https://scikit-learn.org/stable/modules/multiclass.html.

The following snippet demonstrates using the OneVsRestClassifier class to create a multi-class classifier from an ensemble of binary LogisticRegression models on the Iris flowers dataset:

# load Iris flowers dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()
df_iris_features = pd.DataFrame(data = iris_dataset.data, columns=iris_dataset.feature_names)
df_iris_target = pd.DataFrame(data = iris_dataset.target, columns=['class'])
# normalize attribute values
from sklearn.preprocessing import MinMaxScaler
iris_scaler = MinMaxScaler()
iris_scaler.fit(df_iris_features)
nd_iris_features = iris_scaler.transform(df_iris_features)
df_iris_features_normalized = pd.DataFrame(data=nd_iris_features, columns=df_iris_features.columns)
# create a training dataset and a test dataset using a 75/25 split.
from sklearn.model_selection import train_test_split
iris_split = train_test_split(df_iris_features_normalized, df_iris_target, 
                              test_size=0.25, random_state=17)
df_iris_features_train = iris_split[0]
df_iris_features_test = iris_split[1]
df_iris_target_train = iris_split[2]
df_iris_target_test = iris_split[3]
# implement multi-class classification using 
# OVA (a.ka. OVR) approach and LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(penalty='l2', fit_intercept=True, solver='liblinear')
ovr_logit_model = OneVsRestClassifier(logit_model)
ovr_logit_model.fit(df_iris_features_train, df_iris_target_train.values.ravel())
# use the  model to create predictions on the test set, with a threshold of 0.5
ovr_logit_predictions = ovr_logit_model.predict(df_iris_features_test)

You can inspect the classes predicted by the OVR logistic regression model by using the Python print() function. As you can see, the output classes predicted for the members of the test set are one of three values—0, 1, or 2:

You can inspect the class-wise probabilities from the OneVsRestClassifier instance by using the predict_proba() method, just as you did earlier with the LogisticRegression instance. This time, though, the result will have three values for each member of the test set. The first number is the probability that the output class is 0, the second number is the probability that the output class is 1, and so on:

# access class-wise probabilities
ovr_logit_probs = ovr_logit_model.predict_proba(df_iris_features_test)
print(ovr_logit_probs)
[[0.82272514 0.12785864 0.04941622]
 [0.12044579 0.40056122 0.47899299]
 [0.02542865 0.32329645 0.6512749 ]
 [0.18305903 0.42111625 0.39582472]
 [0.05944138 0.38763397 0.55292465]
 [0.07236737 0.36312485 0.56450777]
 [0.16344427 0.37963956 0.45691617]
 [0.01998424 0.24601841 0.73399734]
 [0.18950936 0.48395363 0.32653701]
 [0.03663432 0.40209894 0.56126674]
 [0.02062532 0.27783051 0.70154417]
 [0.73577162 0.22066942 0.04355896]
 [0.15270279 0.42746281 0.41983439]
 [0.77216659 0.18251154 0.04532187]
 [0.05309898 0.32231709 0.62458393]
 [0.815817   0.13825926 0.04592374]
 [0.73489217 0.22191513 0.0431927 ]
 [0.04491288 0.36458749 0.59049964]
 [0.02065056 0.27871118 0.70063826]
 [0.02127991 0.35388486 0.62483523]
 [0.07152985 0.41695375 0.5115164 ]
 [0.7706894  0.18349734 0.04581325]
 [0.07040028 0.36307885 0.56652087]
 [0.19267192 0.4727485  0.33457958]
 [0.15280003 0.38212573 0.46507424]
 [0.17395557 0.31901921 0.50702523]
 [0.12736739 0.48820204 0.38443056]
 [0.13568065 0.44198711 0.42233224]
 [0.7867313  0.16963785 0.04363084]
 [0.17115366 0.45770086 0.37114548]
 [0.74540203 0.20735953 0.04723843]
 [0.31041971 0.43132172 0.25825857]
 [0.80839308 0.15516489 0.03644203]
 [0.80848648 0.13549109 0.05602242]
 [0.21762134 0.48521286 0.29716579]
 [0.15584948 0.41625218 0.42789834]
 [0.19201639 0.40706352 0.4009201 ]
 [0.03199536 0.34175085 0.62625378]]

While training an ensemble of binary models is one way to build models capable of multi-class classification, some algorithms like logistic regression can be modified to inherently support multi-class classification. In the case of logistic regression, the modification involves training multiple linear regression models internally, and replacing the sigmoid function with another function—the softmax function. The softmax function is capable of receiving inputs from multiple linear regression models and outputting class-wise probabilities. The softmax function is also known as the normalized exponential function, and its equation is illustrated in Figure 4.16.

Chart depicts Softmax logistic regression. — **FIGURE 4.16** Softmax logistic regression

To understand how this function works, let's consider the Iris flowers dataset. Each row of the dataset contains four continuous numeric attributes and a multi-class target with three possible output classes: 0, 1, 2. When a softmax logistic regression model is trained on this dataset, it will contain three linear regression models within it, one for each target class. When the model is used for making predictions, each linear regression model will output a continuous numeric value that will be fed into the softmax function, which will in turn output three class-wise probabilities.

Scikit-learn's implementation of the LogisticRegression class is inherently capable of multinomial classification; all you need to do is include the multi_class = 'multinomial' and solver = 'lbfgs' constructor arguments while instantiating the class. The following snippet uses Scikit-learn's LogisticRegression class on a multi-class classification problem with softmax regression:

# implement multi-class classification using 
# softmax (a.k.a multinomial regression) classifier
from sklearn.linear_model import LogisticRegression
softmax_logit_model = LogisticRegression(penalty='l2', fit_intercept=True, solver='lbfgs', multi_class='multinomial')
softmax_logit_model.fit(df_iris_features_train, df_iris_target_train.values.ravel())
# use the  model to create predictions on the test set
softmax_logit_predictions = softmax_logit_model.predict(df_iris_features_test)

You can inspect the classes predicted by the softmax logistic regression model by using the Python print() function. As you can see, the output classes predicted for the members of the test set are one of three values—0, 1, or 2:

You can inspect the class-wise probabilities by calling the predict_proba() method on the LogisticRegression instance. As you would expect, the result has three values for each member of the test set. The first number is the probability that the output class is 0, the second number is the probability that the output class is 1, and so on (see the code example on page 109).

# access class-wise probabilities
softmax_logit_probs = softmax_logit_model.predict_proba(df_iris_features_test)
print(softmax_logit_probs)
[[0.89582633 0.09444564 0.00972803]
 [0.09889138 0.50828121 0.39282741]
 [0.01311439 0.23998685 0.74689876]
 [0.1645445  0.56290434 0.27255115]
 [0.04525701 0.42873174 0.52601125]
 [0.05219811 0.40090084 0.54690105]
 [0.1412285  0.49318467 0.36558684]
 [0.00512255 0.1226479  0.87222954]
 [0.18517246 0.6381944  0.17663314]
 [0.0301524  0.40920296 0.56064464]
 [0.00767857 0.15588852 0.83643291]
 [0.85228205 0.14095824 0.00675971]
 [0.1344201  0.56001268 0.30556722]
 [0.87716333 0.11550045 0.00733622]
 [0.03095013 0.30494959 0.66410028]
 [0.89925442 0.09260716 0.00813842]
 [0.86906457 0.12521241 0.00572302]
 [0.0286017  0.35588746 0.61551083]
 [0.00777372 0.15176958 0.84045669]
 [0.01219112 0.26741715 0.72039173]
 [0.05557848 0.49114599 0.45327553]
 [0.86733298 0.12475638 0.00791064]
 [0.05107527 0.39427714 0.55464759]
 [0.18367433 0.62661521 0.18971046]
 [0.1288656  0.49409071 0.37704369]
 [0.12588161 0.41550633 0.45861205]
 [0.12692498 0.64014023 0.23293479]
 [0.11348037 0.57595155 0.31056808]
 [0.85890116 0.13346798 0.00763086]
 [0.32189697 0.55802372 0.12007931]
 [0.91075124 0.08432501 0.00492375]
 [0.87466417 0.11230055 0.01303528]
 [0.21216432 0.63742445 0.15041124]
 [0.12872766 0.55202598 0.31924635]
 [0.16895463 0.54601673 0.28502864]
 [0.01604996 0.28654283 0.69740721]]

Decision Trees

Decision trees are, as their name suggests, tree-like structures where each parent node represents a decision boundary and child nodes represent outcomes of the decision. The topmost node of the tree is known as the root node. Building a decision tree model involves picking a suitable attribute for the decision at the root node, and then recursively partitioning the tree into nodes until some optimal criteria are met.

Decision trees are very versatile and can be used for both classification and regression tasks. When used for classification tasks, they are inherently capable of handling multi-class problems and are not affected by the scale of individual features. Predictions made by decision trees also have the advantage of being easy to explain—all you need to do is traverse the nodes of the decision tree and you will be able to explain the prediction. This is not the case for models such as neural networks, where it is not possible to explain why the model predicts something. Models such as decision trees that allow you to easily understand the reasoning behind a prediction are called white-box models, whereas models such as neural networks that do not provide the ability to explain a prediction are called black-box models.

Scikit-learn provides the DecisionTreeClassifier class as part of the tree package. We will now use this class to implement a decision tree–based multi-class classification model on the popular Iris flowers dataset:

from sklearn.datasets import load_iris
iris_dataset = load_iris()
df_iris_features = pd.DataFrame(data = iris_dataset.data, columns=iris_dataset.feature_names)
df_iris_target = pd.DataFrame(data = iris_dataset.target, columns=['class'])
# normalize attribute values
from sklearn.preprocessing import MinMaxScaler
iris_scaler = MinMaxScaler()
iris_scaler.fit(df_iris_features)
nd_iris_features = iris_scaler.transform(df_iris_features)
df_iris_features_normalized = pd.DataFrame(data=nd_iris_features, columns=df_iris_features.columns)
# create a training dataset and a test dataset using a 75/25 split.
from sklearn.model_selection import train_test_split
iris_split = train_test_split(df_iris_features_normalized, df_iris_target, 
                              test_size=0.25, random_state=17)
df_iris_features_train = iris_split[0]
df_iris_features_test = iris_split[1]
df_iris_target_train = iris_split[2]
df_iris_target_test = iris_split[3]
# create a decision tree based multi-class classifier.
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth=4)
dtree_model.fit(df_iris_features_train, df_iris_target_train.values.ravel())
# use the  model to create predictions on the test set
dtree_predictions = dtree_model.predict(df_iris_features_test)

The constructor of the DecisionTreeClassifier class takes several parameters, most of which begin with max_ or min_ and are used to enforce constraints on the decision tree. Unlike other model types, decision trees do not have any inherent form of regularization and will aim to fit the training data near-perfectly. The problem with this is that decision tree models are likely to overfit the training data, and the way to prevent overfitting is to enforce constraints on the tree-building process such as the maximum depth of the tree, minimum number of samples in a leaf node, etc. You can find out more about the constructor parameters at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. The predictions made by the model can be examined using the Python print() function:

To visualize the decision tree you will first need to use the sklearn.tree.export_graphviz() function to export the nodes of the tree into the Graphviz DOT file format, and then use Graphviz functions to convert the DOT file into an image. You can learn more about Graphviz at https://pydotplus.readthedocs.io/reference.html. The following snippet can be used to create a graph from a decision tree classifier. The resulting graph is depicted in Figure 4.17.

Flowchart depicts decision tree visualization. — **FIGURE 4.17** Decision tree visualization

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dtree_model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

To make predictions with this decision tree, you start with the condition on the root node: petal length <= 0.271. There are two branches from this node—the branch on the left should be traversed if the condition is met, and the branch on the right is traversed if the condition is not met. You then repeat this process until you reach a leaf node, and the class associated with the leaf node will be the prediction.

The root node also contains some additional information: samples=112 implies that there are 112 samples in the dataset that this condition applies to. The value array implies that out of 112 samples, 40 belong to the first class, 35 to the second class, and 37 to the third class. The gini=0.666 value represents the Gini score associated with the node. A Gini score is a measure of impurity and is one of two impurity measures that Scikit-learn's implementation of decision trees provides, the other one being Entropy. A pure node is one which has elements that all belong to the same class, and a Gini score of 0.0. During the model-building process, the decisions that form the nodes (such as petal_length <= 0.271) are chosen so as to create the purest subsets on both child nodes. Tree building is a recursive process and stops when either the Gini score associated with a node is 0.0 or a criterion such as maximum permissible depth of the tree has been reached. You can learn more about Gini scores in The Gini Methodology: A Primer on Statistical Methodology by Shlomo Yitzhaki and Edna Schechtman (https://www.springer.com/gb/book/9781461447191).

A decision tree can also be used for regression problems, and Scikit-learn provides the DecisionTreeRegressor class to create decision trees for regression. A decision tree for regression is very similar to a tree used for classification, with the key difference being that each node predicts a numeric value instead of a class. The following snippet uses the DecisionTreeRegressor class to create a decision tree on the Boston housing dataset and uses the model to predict median house prices for the members of the test set. Figure 4.18 contains the decision tree generated by the model.

Flowchart depicts decision tree for regression. — **FIGURE 4.18** Decision tree for regression

# create a decision tree based regressor on the Boston housing dataset.
from sklearn.tree import DecisionTreeRegressor
dtree_reg_model = DecisionTreeRegressor(max_depth=4)
dtree_reg_model.fit(df_boston_features_train, df_boston_target_train.values.ravel())
# use the  model to create predictions on the test set
dtree_reg_predictions = dtree_reg_model.predict(df_boston_features_test)

NOTE To follow along with this chapter ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-chapter4.git

Summary

Scikit-learn is a Python library that provides a number of features that are suitable for machine learning engineers and data scientists.
Scikit-learn provides a function called train_test_split() in the model_selection module that can be used to split a Pandas dataframe into two dataframes, one for model building and the other for model evaluation.
K-fold cross validation can help minimize the possibility of the model picking up on unexpected bias in the training set.
Linear regression is a statistical technique that aims to find the equation of a line (or hyperplane) that is closest to all the points in the dataset.
Scikit-learn implements linear regression in a class called LinearRegression, which is part of the linear_model module.
A support vector machine (SVM) is a versatile model that can be used for a variety of tasks, including classification, regression, and outlier detection.
Scikit-learn provides an implementation of support vector machine–based classifiers in the, SVC class, which is part of the sklearn.svm module.
Support vector models use a mathematical function called a kernel that is used to transform each input point into a higher-dimensional space where a linear decision boundary can be found.
Logistic regression, despite having the word regression in its name, is a technique that can be used to build binary and multi-class classifiers.
Scikit-learn provides the LogisticRegression class as part of the linear_model module.
Logistic regression is inherently a binary classifier, but it can be used as a multi-class classifier for datasets where the target variable can belong to more than two classes.
Decision trees are very versatile and can be used for both classification and regression tasks. When used for classification tasks they are inherently capable of handling multi-class problems and are not affected by the scale of individual features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.