Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5
Evaluating Machine Learning Models

WHAT'S IN THIS CHAPTER

Learn how to evaluate the performance of regression models
Learn how to evaluate the performance of classification models
Learn to use the grid-search technique to choose the optimal set of hyperparameters for your model

In the previous chapter, you learned how to use Scikit-learn to create different types of machine learning models. In this chapter you will learn to use Scikit-learn to evaluate the performance of the models you have trained and techniques to select the values of hyperparameters that will result in an optimal model.

Since the purpose of a machine learning model is to predict something correctly, you will want to ensure that the predictive accuracy of your model is good enough for you to deploy it into production. It is therefore important to evaluate the performance of the model on data that the model has not seen previously, so as to get an accurate picture of how the model is likely to perform on real-world data (which it will also not have seen previously). Techniques such as creating test-train splits and k-fold cross validation, both of which have been discussed in Chapter 4, allow you to keep aside some of the training data for evaluation.

NOTE To follow along with this chapter ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-chapter5.git

Evaluating Regression Models

The purpose of a linear regression model is to predict a continuous numeric value, such as a house price. There are two types of techniques that can be used to evaluate the predictive accuracy of a regression model: creating a scatter plot of the true and predicted values and computing a statistical metric that captures the total prediction error across members of the test set. The visual results obtained from a 2D scatter plot are simple to understand: the x-axis contains the actual values, and the y-axis contains the predicted values. The closer the points are to a 45-degree line anchored at the origin, the better the prediction will be.

The following snippet uses Scikit-learn to load the Boston housing prices dataset and train a linear regression model and a decision tree–based model on the data. A scatter plot of the actual vs. predicted value for both models is presented side-by-side in Figure 5.1.

Graphs depict comparison of predictive accuracies of a linear regression model and decision tree model on the Boston housing data. — **FIGURE 5.1** Comparison of predictive accuracies of a linear regression model and decision tree model on the Boston housing data

import numpy as np
import pandas as pd
 
# load boston house prices dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
df_boston_features = pd.DataFrame(data = boston_dataset.data, columns=boston_dataset.feature_names)
df_boston_target = pd.DataFrame(data = boston_dataset.target, columns=['price'])
 
# create a training dataset and a test dataset using a 75/25 split.
from sklearn.model_selection import train_test_split
 
boston_split = train_test_split(df_boston_features, df_boston_target,
                              test_size=0.25, random_state=17)
df_boston_features_train = boston_split[0]
df_boston_features_test = boston_split[1]
df_boston_target_train = boston_split[2]
df_boston_target_test = boston_split[3]
 
# train a linear model on the Boston house prices dataset.
from sklearn.linear_model import LinearRegression
linear_reg_model = LinearRegression(fit_intercept=True)
linear_reg_model.fit(df_boston_features_train, df_boston_target_train)
 
# create a decision tree based regressor on the Boston house prices dataset.
from sklearn.tree import DecisionTreeRegressor
dtree_reg_model = DecisionTreeRegressor(max_depth=10)
dtree_reg_model.fit(df_boston_features_train, df_boston_target_train.values.ravel())
 
# use the  models to create predictions on the test set
linear_reg_predictions = linear_reg_model.predict(df_boston_features_test)
dtree_reg_predictions = dtree_reg_model.predict(df_boston_features_test)
 
# create a scatter plot of predicted vs actual values
 
%matplotlib inline
import matplotlib.pyplot as plt
 
fig, axes = plt.subplots(1, 2, figsize=(18,9))
axes[0].scatter(df_boston_target_test, linear_reg_predictions) 
axes[0].set_xlabel("Expected Prices")
axes[0].set_ylabel("Predicted prices")
axes[0].set_title("Linear Regression Model")
 
axes[1].scatter(df_boston_target_test, dtree_reg_predictions)
axes[1].set_xlabel("Expected Prices")
axes[1].set_ylabel("Predicted prices")
axes[1].set_title("Decision Tree Model")
 
# plot the ideal prediction line
IdealPrices = np.linspace(0.0, df_boston_target_test.values.max(), 50)
IdealPredictions = IdealPrices
axes[0].plot(IdealPrices, IdealPredictions, color='#ff0000', label='ideal prediction line')
axes[1].plot(IdealPrices, IdealPredictions, color='#ff0000', label='ideal prediction line')
 
axes[0].legend()
axes[1].legend()

In addition to obtaining visual measures of accuracy, you can also use statistical techniques to evaluate the quality of a regression model. We will look at some of the commonly used metrics next.

RMSE Metric

The root mean squared error (RMSE) metric is popular when it comes to evaluating the performance of a regression model. As the name suggests, root mean square is the square root of the mean squared error (MSE). For a given item in the test set, the error in prediction is the difference between the predicted value and the actual value; this error can be either positive or negative, and squaring it ensures that the direction of the error does not matter (as squared numbers are always positive). The mean squared error is the mean of the squared prediction errors of all the items in the training set. This is illustrated in Figure 5.2.

Graph depicts mean squared error and root mean squared error. — **FIGURE 5.2** Mean squared error and root mean squared error

Scikit-learn encapsulates the computation of the mean squared error in the sklearn.metrics.mean_squared_error() function. The following snippet uses this function to compute the MSE and RMSE of the predictions made by the linear regression model and the decision tree regression model on the Boston housing prices dataset:

# compute MSE , RMSE, using Scikit-learn
 
from sklearn.metrics import mean_squared_error
mse_linear_reg_model = mean_squared_error(df_boston_target_test, linear_reg_predictions)
mse_dtree_reg_model = mean_squared_error(df_boston_target_test, dtree_reg_predictions)
 
from math import sqrt
rmse_linear_reg_model = sqrt(mse_linear_reg_model)
rmse_dtree_reg_model = sqrt(mse_dtree_reg_model)

You can examine the RMSE values using the Python print() function. As you can see, the RMSE for the decision tree–based model is lower than the RMSE for the linear regression model, which implies the decision tree–based model is the better of the two:

The benefit of root mean squared error is that its value is in the same units as the variable you are trying to predict. For example, the house prices in the Boston housing prices dataset are expressed in units of $1,000.00. An RMSE of 3.502 implies that on an average the house prices predicted by this model will be off from the true house price by $3502.00. Since the value of the RMSE is in the units of the target variable, it is easy to understand.

Unfortunately, the drawback of RMSE is that its value is sensitive to the magnitude of the target variable. If the target variables are larger numbers, the value of the RMSE will be a larger number, and therefore the RMSE cannot be used to compare models trained on different datasets.

R² Metric

The R² metric is another statistical metric that can be used to get an idea of the quality of the model. However, unlike RMSE, the value of the R² metric always lies between 0.0 and 1.0. The R² metric is also known as the coefficient of determination, and is a measure of the distance of the predicted values from the regression line.

Scikit-learn encapsulates the computation of the coefficient of determination in the sklearn.metrics.mean_r2_score() function. The following snippet uses this function to compute the R² score of the predictions made by the linear regression model and the decision tree regression model on the Boston housing prices dataset:

# compute coefficient of determination (r2 score)
 
from sklearn.metrics import r2_score
r2_linear_reg_model = r2_score(df_boston_target_test, linear_reg_predictions)
r2_dtree_reg_model = r2_score(df_boston_target_test, dtree_reg_predictions)

You can examine the R2 values using the Python print() function. As you can see, the R2 for the decision tree–based model is higher than the R2 score for the linear regression model, which implies the decision tree–based model is, once again, the better of the two:

Evaluating Classification Models

There are two types of classification models: binary and multi-class. Binary classification models are used when the target attribute can have only two discrete values (or classes). Multi-class classification models are used when the target attribute can have more than one discrete value. Let's consider binary classification models first.

Binary Classification Models

One of the simplest metrics that can be used to gauge the quality of the model is to simply count the number of times the model predicts the correct class. Whether or not this value has any meaning would depend on the proportion of samples that belong to each class, and the significance of the classes themselves. For instance, if a test set contains 100 samples, 50 of which are from class A and the other 50 from class B, with neither class being more significant to the problem domain than the other, then a model that predicts the correct class 80% of the time is straightforward to understand. If, however, 95 items in the test set were from class A and only 5 were from class B, the model that predicts the correct class 80% of the time is not so good after all. The problem can be significantly worse if the model was meant to predict whether an individual had a deadly illness, and the five samples from class B were the only ones that indicated presence of the illness. This is not an impossible situation; there is a very high possibility that a random sample of the general population will have very few individuals with a specific illness.

Before we look at better measures of a binary classification model's performance, let's first see how we can compute this simple metric. The following snippet trains a logistic regression, support vector machine (SVM), and decision tree classifier on the Pima Indians diabetes dataset and computes the percentage of correct predictions:

import numpy as np
import pandas as pd
import os
 
# load Pima Indians Diabetes dataset
diabetes_dataset_file = './datasets/diabetes_dataset/diabetes.csv'
df_diabetes = pd.read_csv(diabetes_dataset_file)
df_diabetes_target = df_diabetes.loc[:,['Outcome']]
df_diabetes_features = df_diabetes.drop(['Outcome'], axis=1)
 
# normalize attribute values
from sklearn.preprocessing import MinMaxScaler
 
diabetes_scaler = MinMaxScaler()
diabetes_scaler.fit(df_diabetes_features)
nd_diabetes_features = diabetes_scaler.transform(df_diabetes_features)
df_diabetes_features_normalized = pd.DataFrame(data=nd_diabetes_features, columns=df_diabetes_features.columns)
 
# create a training dataset and a test dataset using a 75/25 split.
from sklearn.model_selection import train_test_split
 
diabetes_split = train_test_split(df_diabetes_features_normalized, df_diabetes_target,
                              test_size=0.25, random_state=17)
df_diabetes_features_train = diabetes_split[0]
df_diabetes_features_test = diabetes_split[1]
df_diabetes_target_train = diabetes_split[2]
df_diabetes_target_test = diabetes_split[3]
 
# train an SVM classifier for the features of the diabetes dataset using an RBF kernel
from sklearn.svm import SVC
svc_model = SVC(kernel='rbf', C=1, gamma='auto')
svc_model.fit(df_diabetes_features_train, df_diabetes_target_train.values.ravel())
 
# train a logistic regression model on the diabetes dataset
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(penalty='l2', fit_intercept=True, solver='liblinear')
logit_model.fit(df_diabetes_features_train, df_diabetes_target_train.values.ravel())
 
# train a decision tree based binary classifier.
from sklearn.tree import DecisionTreeClassifier
 
dtree_model = DecisionTreeClassifier(max_depth=4)
dtree_model.fit(df_diabetes_features_train, df_diabetes_target_train.values.ravel())
 
# use the models to create predictions on the diabetes test set
svc_predictions = svc_model.predict(df_diabetes_features_test)
logit_predictions = logit_model.predict(df_diabetes_features_test)
dtree_predictions = dtree_model.predict(df_diabetes_features_test)
 
# simplistic metric - the percentage of correct predictions
svc_correct = svc_predictions == df_diabetes_target_test.values.ravel()
svc_correct_percent = np.count_nonzero(svc_correct) / svc_predictions.size * 100
 
logit_correct = logit_predictions == df_diabetes_target_test.values.ravel()
logit_correct_percent = np.count_nonzero(logit_correct) / logit_predictions.size * 100
 
dtree_correct = dtree_predictions == df_diabetes_target_test.values.ravel()
dtree_correct_percent = np.count_nonzero(dtree_correct) / dtree_predictions.size * 100

Using the Python print() function, the results indicate that the logistic regression model and the decision tree model perform about the same, but the logistic regression model has the edge:

Now let's look at other performance metrics that could be used in binary classification problems. Some of the problems with simply counting the number of times a model predicts the correct answer were introduced earlier in this chapter. To get a better idea of the model's performance, what you need is a set of metrics that capture the class-wise performance of the model. The most commonly used primary metrics for binary classification are listed next. These metrics assume one of the two target classes represents a positive outcome, whereas the other represents a negative outcome:

True positive (TP) count: The number of times the model predicted a positive outcome, and the prediction was correct.
False positive (FP) count: The number of times the model predicted a positive outcome, and the prediction was incorrect.
True negative (TN) count: The number of times the model predicted a negative outcome, and the prediction was correct.
False negative (FN) count: The number of times the model predicted a negative outcome, and the prediction was incorrect.

This class-wise prediction accuracy is also sometimes referred to as the class-wise confusion, and when these four primary metrics are placed in a 2 × 2 matrix, the resulting matrix is called the confusion matrix. Figure 5.3 depicts a confusion matrix and explains what the individual numbers mean.

Chart depicts a class-wise confusion matrix. — **FIGURE 5.3** A class-wise confusion matrix

Scikit-learn provides a function called confusion_matrix() in the metrics submodule that can be used to compute the confusion matrix for a classification model. You can find detailed information on the parameters of the confusion_matrix() function at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html.

The following snippet demonstrates the use of this function to compute the confusion matrix for the three binary classification models created earlier in this section:

# compute confusion matrix
from sklearn.metrics import confusion_matrix
 
cm_svc = confusion_matrix(df_diabetes_target_test.values.ravel(), svc_predictions)
cm_logit = confusion_matrix(df_diabetes_target_test.values.ravel(), logit_predictions)
cm_dtree = confusion_matrix(df_diabetes_target_test.values.ravel(), dtree_predictions)
 
# extract true negative , false positive, false negative, true positive.
#
# the sklearn confusion_matrix() function returns in the following matrix
#
#       TN     FP
#       FN     TP
 
tn_svc, fp_svc, fn_svc, tp_svc = cm_svc.ravel()
tn_logit, fp_logit, fn_logit, tp_logit = cm_logit.ravel()
tn_dtree, fp_dtree, fn_dtree, tp_dtree = cm_dtree.ravel()

You can now examine the values of the primary metrics for the three models using the Python print() function:

In addition to these primary statistical metrics, data scientists use the following secondary metrics. The values of these metrics are computed from the primary metrics:

Accuracy: This is defined as (TP + TN) / (Total number of predictions). This is basically the same as counting the number of times the model predicted the correct value.
Precision: This is defined as TP / (TP + FP). If you look at Figure 5.3, you will notice that the denominator is the total number of positive predictions made by the model. Therefore, precision can also be written as TP / (Total number of positive prediction), and it measures how precise your model is. Precision is a good measure to use when there is a high cost associated with a false positive. The closer the value is to 1.0, the more precise are the positive predictions.
Recall: This is defined as TP / (TP + FN). Looking at Figure 5.3, you will notice that the denominator of this expression is the actual number of positive samples in the dataset. Therefore, the expression to compute recall can be rewritten as TP / (Total number of positive samples in the evaluation set). Recall is a good measure to use when there is a high cost associated with false negatives.

The following snippet computes the accuracy, precision, and recall values for the three models, using the information obtained from the confusion matrix:

# compute accuracy, precision, and recall
 
accuracy_svc = (tp_svc + tn_svc) / (tn_svc + fp_svc + fn_svc + tp_svc)
accuracy_logit = (tp_logit + tn_logit) / (tn_logit + fp_logit + fn_logit + tp_logit)
accuracy_dtree = (tp_dtree + tn_dtree) / (tn_dtree + fp_dtree + fn_dtree + tp_dtree)
 
precision_svc = tp_svc / (tp_svc + fp_svc)
precision_logit = tp_logit / (tp_logit + fp_logit)
precision_dtree = tp_dtree / (tp_dtree + fp_dtree)
 
recall_svc = tp_svc / (tp_svc + fn_svc)
recall_logit = tp_logit / (tp_svc + fn_logit)
recall_dtree = tp_dtree / (tp_dtree + fn_dtree)

Using the Python print() function, you can examine the accuracy, precision, and recall:

print (accuracy_svc, accuracy_logit, accuracy_dtree)
>> 0.7395833333333334   0.765625   0.7552083333333334
 
print (precision_svc, precision_logit, precision_dtree)
>> 0.7837837837837838   0.8095238095238095    0.7
 
print (recall_svc, recall_logit, recall_dtree)
0.4084507042253521   0.5151515151515151   0.5915492957746479

The logistic regression model has the highest precision of 80.95%, whereas the decision tree has the highest recall of 59.14%. Your choice of model will be influenced by the problem domain. What matters more: precision or recall?

It is important to note that all three classification models trained in this section compute class-wise prediction probabilities, and then threshold the probabilities to arrive at the binary output class. Scikit-learn uses a threshold of 0.5, and while it does not allow you to change this threshold, it does provide access to the underlying probabilities so that you can implement your own threshold if you want to. This means that the performance metrics discussed in this section will change for different values of the threshold. One possibility is to compute the confusion matrix for 100 thresholds between 0.0 and 1.0 and pick the threshold value that provides the best balance between accuracy, precision, and recall.

Data scientists often use a visualization tool called the ROC (receiver operating characteristics) curve, and a metric called AUC (area under ROC curve) to evaluate the quality of a classification model. The ROC curve is computed by plotting the rate of true positives on the y-axis, against the rate of false positives on the x-axis for a number of different confusion matrices, computed for threshold values between 0.0 and 0.1.

Scikit-learn provides a function called roc_curve() in the metrics submodule that can be used to compute the true and false positive rates (TPR and FPR) for a number of different threshold values. You can find detailed information on the parameters of the roc_curve() function at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html.

The following snippet demonstrates the use of the roc_curve() function to compute the true and false positive rates for the three binary classification models created earlier in this section, and plots the ROC curves using Matplotlib's pyplot module. The resulting ROC curves are depicted in Figure 5.4.

Graphs depict ROC curves for three binary classification models. — **FIGURE 5.4** ROC curves for three binary classification models

# plot ROC curves for the three classifiers.
 
# compute prediction probabilities
svc_probabilities = svc_model.predict_proba(df_diabetes_features_test)
logit_probabilities = logit_model.predict_proba(df_diabetes_features_test)
dtree_probabilities = dtree_model.predict_proba(df_diabetes_features_test)
 
# calculate the FPR and TPR for all thresholds of the SVC model
import sklearn.metrics as metrics
svc_fpr, svc_tpr, svc_thresholds = metrics.roc_curve(df_diabetes_target_test.values.ravel(),
                                                     svc_probabilities[:,1],
                                                     pos_label=1,
                                                     drop_intermediate=False)
 
# calculate the FPR and TPR for all thresholds of the logistic regression model
logit_fpr, logit_tpr, logit_thresholds = metrics.roc_curve(df_diabetes_target_test.values.ravel(),
                                                           
logit_probabilities[:,1],
                                                     pos_label=1,
drop_intermediate=False)
 
# calculate the FPR and TPR for all thresholds of the decision tree model
dtree_fpr, dtree_tpr, dtree_thresholds = metrics.roc_curve(df_diabetes_target_test.values.ravel(),
                                                           
dtree_probabilities[:,1],
                                                     pos_label=1,
drop_intermediate=False)
 
 
 
# plot ROC curves
%matplotlib inline
import matplotlib.pyplot as plt
 
fig, axes = plt.subplots(1, 3, figsize=(18,6))
 
axes[0].set_title('ROC curve: SVC model')
axes[0].set_xlabel("True Positive Rate")
axes[0].set_ylabel("False Positive Rate")
axes[0].plot(svc_fpr, svc_tpr)
axes[0].axhline(y=0, color='k')
axes[0].axvline(x=0, color='k')
 
axes[1].set_title('ROC curve: Logit model')
axes[1].set_xlabel("True Positive Rate")
axes[1].set_ylabel("False Positive Rate")
axes[1].plot(logit_fpr, logit_tpr)
axes[1].axhline(y=0, color='k')
axes[1].axvline(x=0, color='k')
 
axes[2].set_title('ROC curve: Tree model')
axes[2].set_xlabel("True Positive Rate")
axes[2].set_ylabel("False Positive Rate")
axes[2].plot(dtree_fpr, dtree_tpr)
axes[2].axhline(y=0, color='k')
axes[2].axvline(x=0, color='k')

Scikit-learn also provides the auc() function in the metrics module that can compute the area under the ROC curve from the true and false positive rates returned by the roc_curve() function. The following snippet computes the AUC metric for the three classification models:

# compute AUC metrics for the three models
svc_auc = metrics.auc(svc_fpr, svc_tpr)
logit_auc = metrics.auc(logit_fpr, logit_tpr)
dtree_auc = metrics.auc(dtree_fpr, dtree_tpr)

Using the Python print() function to examine the values of the AUC metrics for the three models, the logistic regression model provides the highest AUC value:

Multi-Class Classification Models

Multi-class classification models are used when the target variable can belong to more than two classes. Fortunately, most of the techniques used to evaluate the performance of binary classification models can be applied to their multi-class counterparts. The concept of the confusion matrix can easily be extended to multi-class classification. The confusion matrix for an n-class classification would be represented as an n × n matrix where the rows represent the actual classes, and the columns represent the predicted classes. Figure 5.5 depicts a multi-class confusion matrix for a five-class problem.

Chart depicts multi-class confusion matrix for a five-class dataset. — **FIGURE 5.5** Multi-class confusion matrix for a five-class dataset

Scikit-learn's confusion_matrix() function can also be used to compute the confusion matrix for multi-class classification models. The following code snippet trains a softmax logistic and decision tree model on the Iris flowers training dataset, uses the models to make predictions, and computes the confusion matrix:

# load Iris flowers dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()
df_iris_features = pd.DataFrame(data = iris_dataset.data, columns=iris_dataset.feature_names)
df_iris_target = pd.DataFrame(data = iris_dataset.target, columns=['class'])
 
# normalize attribute values
from sklearn.preprocessing import MinMaxScaler
iris_scaler = MinMaxScaler()
iris_scaler.fit(df_iris_features)
nd_iris_features = iris_scaler.transform(df_iris_features)
df_iris_features_normalized = pd.DataFrame(data=nd_iris_features, columns=df_iris_features.columns)
 
# create a training dataset and a test dataset using a 75/25 split.
from sklearn.model_selection import train_test_split
iris_split = train_test_split(df_iris_features_normalized, df_iris_target,
                              test_size=0.25, random_state=17)
df_iris_features_train = iris_split[0]
df_iris_features_test = iris_split[1]
df_iris_target_train = iris_split[2]
df_iris_target_test = iris_split[3]
 
# softmax (a.k.a multinomial regression) classifier
from sklearn.linear_model import LogisticRegression
softmax_logit_model = LogisticRegression(penalty='l2', fit_intercept=True, solver='lbfgs', multi_class='multinomial')
softmax_logit_model.fit(df_iris_features_train, df_iris_target_train.values.ravel())
 
# create a decision tree based multi-class classifier.
from sklearn.tree import DecisionTreeClassifier
mc_dtree_model = DecisionTreeClassifier(max_depth=4)
mc_dtree_model.fit(df_iris_features_train, df_iris_target_train.values.ravel())
 
# use the  model to create predictions on the test set
softmax_logit_predictions = softmax_logit_model.predict(df_iris_features_test)
mc_predictions = mc_dtree_model.predict(df_iris_features_test)
 
# compute confusion matrix
from sklearn.metrics import confusion_matrix
 
cm_softmax = confusion_matrix(df_iris_target_test.values.ravel(), softmax_logit_predictions)
cm_mc_dtree = confusion_matrix(df_iris_target_test.values.ravel(), mc_predictions)

You can inspect the confusion matrices generated by the two classifiers by using the Python print() function:

Multi-class confusion matrices are often visualized as heat maps. The following snippet visualizes the two confusion matrices as heat maps side by side. Portions of the snippet have been adapted from https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html. The resulting heatmaps are depicted in Figure 5.6.

Charts depict multi-class confusion matrix for two models trained on the Iris flowers dataset. — **FIGURE 5.6** Multi-class confusion matrix for two models trained on the Iris flowers dataset

# plot confusion matrices as heatmaps
%matplotlib inline
import matplotlib.pyplot as plt
 
# plot the confusion matrix as a heat map using Matplotlib functions.
#
# portions of this code have been adapted from
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cmatrix, class_labels, axes, title, cmap):
 
    heatmap_image = axes.imshow(cmatrix, interpolation='nearest', cmap=cmap)
    axes.figure.colorbar(heatmap_image, ax=axes)
 
    num_rows = cmatrix.shape[0]
    num_cols = cmatrix.shape[1]
 
    axes.set_title(title)
    axes.set_xlabel('Predicted')
    axes.set_ylabel('True')
 
    axes.set_xticks(np.arange(num_cols))
    axes.set_yticks(np.arange(num_rows))
    axes.set_xticklabels(class_labels)
    axes.set_yticklabels(class_labels)
 
    # Loop over data dimensions and create text annotations.
    #fmt = '.2f' if normalize else 'd' 
    thresh = cmatrix.max() / 2.
    for y in range(num_rows):
        for x in range(num_cols):
            axes.text(x, y, format(cmatrix[y, x], '.0f'),
                    ha="center", va="center",
                    color="white" if cmatrix[y, x] > thresh else "black")
 
fig, axes = plt.subplots(1, 2, figsize=(18,6))
 
plot_confusion_matrix(cm_softmax,
                      iris_dataset.target_names, axes[0],
                      "Softmax Regression",
                      plt.cm.Greens)
 
plot_confusion_matrix(cm_mc_dtree,
                      iris_dataset.target_names, axes[1],
                      "Decision Tree Regression",
                      plt.cm.Greens)

For a multi-class classification model, you can compute the overall accuracy metric, and the class-wise precision, and the recall. These metrics would be defined as follows:

Accuracy: In a multi-class classification model, the overall accuracy of the model is the number of times the model predicted the correct answer. Using a confusion matrix, accuracy can be represented as the Sum of the diagonal elements / Total number of predictions. In the case of the confusion matrix for the decision tree model depicted in Figure 5.6, the overall accuracy is 37/38.
Precision: In a multi-class classification problem, the class-wise precision for a given class would be defined as the Number of times the model predicted the class correctly / Total number of predictions made for that class. Using a confusion matrix, precision can be represented graphically as the ratio of the number in the diagonal element to the column total. In the case of the confusion matrix for the decision tree model depicted in Figure 5.6, the class-wise precision for the Setosa class is 10/10, Versicolor is 15/16, and Virginica is 12/12.
Recall: In a multi-class classification problem, the class-wise recall for a given class would be defined as the Number of times the model predicted the class correctly / Total number of elements of that class. Using a confusion matrix, recall can be represented graphically as the ratio of the number in the diagonal element to the row total. In the case of the confusion matrix for the decision tree model depicted in Figure 5.6, the class-wise recall for the Setosa class is 10/10, the Versicolor class is 15/15, and Virginica is 12/13.

Choosing Hyperparameter Values

During a machine learning project, creating a single model and computing metrics is not enough. As you have learned in this chapter and the previous ones, there are a number of factors that could influence the performance of the model, from feature engineering to the choice of model and hyperparameters used during the model building. In order to automate the process of finding the optimal model, data scientists often use the grid search technique to try various combinations of hyperparameters and pick the model that appears to perform best. Scikit-learn provides the GridSearchCV class in the model_selection module that can be used to perform an exhaustive search over a set of parameter values. You can learn more about the GridSearchCV class at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

The following code snippet uses the GridSearchCV class to try different hyperparameter combinations for a multi-class decision tree classifier on the Iris flowers dataset and returns the hyperparameters that result in the best precision score:

# use grid search to find the hyperparameters that result
# in the best accuracy score for a decision tree
# based classifier on the Iris Flowers dataset
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
 
grid_params = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'max_features': ['auto', 'sqrt', 'log2'],
    'presort': [True, False]
}
 
grid_search = GridSearchCV(estimator=DecisionTreeClassifier(),
                           param_grid=grid_params, scoring='accuracy',
                           cv=10, n_jobs=-1)
 
grid_search.fit(df_iris_features.values, df_iris_target)

In this snippet, grid_params is a dictionary of hyperparameters and the values for each parameter that you want to try. The elements of the dictionary will depend on the model that you want to train. This example uses a decision tree classification model and the dictionary has six entries that correspond to some of the hyperparameters of the DecisionTreeClassifier class. The grid search process will build a model with each combination of hyperparameters, which in this case will be 2 × 2 × 11 × 11 × 3 × 2 = 15,488 models.

The cv=10 parameter in the constructor of the GridSearchCV class is used to control the number of folds, which in this case is set to 10 and therefore each of the 15,488 models will be trained using 10-fold cross validation. Depending on the speed of your computer, this code can take a significant amount of time to execute. Once the grid search is complete, you can inspect the value of the hyperparameters that resulted in the best model, and the accuracy of that model, using the following statements:

best_parameters = grid_search.best_params_
print(best_parameters)
 
{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'sqrt', 'min_samples_split': 12, 'presort': False, 'splitter': 'random'}
 
best_accuracy = grid_search.best_score_
print(best_accuracy)
 
0.98

NOTE To follow along with this chapter ensure you have installed Anaconda Navigator and Jupyter Notebook as described in Appendix A.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud or from GitHub using the following URL:

https://github.com/asmtechnology/awsmlbook-chapter5.git

Summary

There are two types of techniques that can be used to evaluate the predictive accuracy of a regression model: creating a scatter plot of the true and predicted values, and computing a statistical metric that captures the total prediction error across members of the test set.
The root mean squared error (RMSE) metric is popular when it comes to evaluating the performance of a regression model. As the name suggests, root mean square is the square root of the mean squared error (MSE).
The R² metric is another statistical metric that can be used to get an idea of the quality of the model.
The R² metric is also known as the coefficient of determination. It is a measure of the distance of the predicted values from the regression line.
The true positive count for a binary classification model is defined as the number of times the model predicted a positive outcome, and the prediction was correct.
The false positive count for a binary classification model is defined as the number of times the model predicted a positive outcome, and the prediction was incorrect.
The true negative count for a binary classification model is defined as the number of times the model predicted a negative outcome, and the prediction was correct.
The false negative count for a binary classification model is defined as the number of times the model predicted a negative outcome, and the prediction was incorrect.
Accuracy, precision, and recall are additional metrics that can be used to evaluate a binary classification model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Evaluating Machine Learning Models

Create new playlist

Sign In

Sign Up