We can now test the methods discussed in this chapter to solve a regression problem and a classification problem. To avoid overfitting, the dataset is typically split into two sets: the training set, in which the model parameters are fitted, and a test set, where the accuracy of the model is evaluated. However, it may be necessary to use a third set, the validation set, in which the hyperparameters (for example, C and for SVM, or α in ridge regression) can be optimized. The original dataset may be too small to allow splitting into three sets, and also the results may be affected by the particular choice of data points on the training, validation, and test sets. A common way to solve this issue is by evaluating the model following the so-called cross-validation procedure—the dataset is split into k subsets (called folds) and the model is trained as follows:
We are using the housing dataset of Boston's suburbs stored at http://archive.ics.uci.edu/ml/datasets/Housing and in the author's repository (https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/), in which the code used in this paragraph is also available. The dataset has 13 features:
To evaluate the quality of the models, the mean squared error defined in the introduction and the coefficient of determination, R2, are calculated. R2 is given by:
Here, yipred indicates the predicted label from the model.
The best result is R2=1, which means the model perfectly fits the data, while R2=0 is associated with a model with a constant line (negative values indicate an increasingly worse fit). The code to compute to train the linear regression, ridge regression, lasso regression, and SVM regression using the sklearn
library is as follows (IPython notebook at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/):
The housing data is loaded using the pandas library and reshuffled to randomize the cross-validation folds subset data (10 folds have been used) by applying the function df.iloc[np.random.permutation(len(df))]
. The output of this script is as follows:
The best model fit is obtained using a random forest (with 50 trees); it returns an average coefficient of determination of 0.86 and MSE=11.5. As expected, the decision tree regressor has a lower R2 and higher MSE than the random forest (0.67 and 25 respectively). The support vector machine with the rbf kernel (C=1, ) is the worst model, with a huge MSE error 83.9 and 0.0 at R2, while SVM with the linear kernel (C=1, ) returns a decent model (0.69 R2 and 25.8 MSE). The lasso and ridge regressors have comparable results, around 0.7 R2 and 24 MSE. An important procedure to improve the model results is feature selection. It often happens that only a subset of the total features is relevant to perform the model training while the other features may not contribute at all to the model R2. Feature selection may improve R2 because misleading data is disregarded and training time is reduced (fewer features to consider).
There are many techniques for extracting the best features for a certain model, but in this context, we explore the so-called recursive feature elimination method (RSE), which essentially considers the attributes associated with the largest absolute weights until the desired number of features are selected. In the case of the SVM algorithm, the weights are just the values of w, while for regression, they are the model parameters θ. Using the sklearn
built-in function RFE
specifying only the best four attributes (best_features
):
The output is:
The RFE
function returns a list of Booleans (the support_
attribute) to indicate which features are selected (true values) and which are not (false values). The selected features are then used to evaluate the model as we have done before.
Even by using only four features, the best model remains the random forest with 50 trees, and the R2 is just marginally lower than that for the model trained with the full set of features (0.82 against 0.86). The other models—lasso, ridge, decision tree, and linear SVM regressors—have a more significant R2 drop, but the results are still comparable with their corresponding full-trained models. Note that the KNN algorithm does not provide weights on the features, so the RFE
method cannot be applied.
To test the classifiers learned in this chapter, the dataset about car evaluation quality (inaccurate, accurate, good, and very good) based on six features that describe the main characteristics of a car (buying price, maintenance cost, number of doors, number of persons to carry, size of luggage boot, and safety). The dataset can be found at http://archive.ics.uci.edu/ml/datasets/Car+Evaluation or on my GitHub account, together with the code discussed here (https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/). To evaluate the accuracy of the classification, we will use the precision, recall, and f-measure. Given a dataset with only two classes (positive and negative), we define the number of true positive points (tp) the points correctly labeled as positive, the number of false positive (fp) the points wrongly labeled as positive (negative points) and the number of false negative (fn) the number of points erroneously assigned to the negative class. Using these definitions, the precision, recall and f-measure can be calculated as:
In a classification problem, a perfect precision (1.0) for a given class C means that each point assigned to class C belongs to class C (there is no information about the number of points from class C erroneously labeled), whereas a recall equal to 1.0 means that each point from class C was labeled as belonging to class C (but there is no information about the other points wrongly assigned to class C).
Note that in the case of multiple classes, these metrics are usually calculated as many times the number of labels, each time considering a class as the positive and all others as the negative. Different averages over the multiple classes' metrics are then used to estimate the total precision, recall, and f-measure.
The code to classify the cars dataset is as follows. First, we load all the libraries and the data into a pandas data frame.
The following are the feature values that are categorical:
buying 0 v-high, high, med, low maintenance 1 v-high, high, med, low doors 2 2, 3, 4, 5-more persons 3 2, 4, more lug_boot 4 small, med, big safety 5 low, med, high car evaluation 6 unacc,acc,good,vgood
These are mapped into numbers to be used in the classification algorithms:
Since we need to calculate and save the measures for all the methods, we write a standard function, CalcMeasures
, and divide the labels' vector Y
from the features X
:
A 10
crossing validation folds has been used and the code is:
The measures' values are stored in the data frames:
Each measure has been evaluated four times—the number of car evaluation classes that fill the arrays according to the index mapping:
'acc': 0, 'unacc': 2, 'good': 1, 'vgood': 3
The best model is SVM with rbf kernel (C=50), but random forest (50 trees) and decision trees also return excellent results (measures over 0.9 for all the four classes). Naive Bayes, logistic regression, and SVM with linear kernel (C=50) return poor models, especially for the accurate, good, and very good classes, because there are few points with those labels:
In percentage, the very good (v-good) and good are 3.993% and 3.762% respectively, compared to 70.0223% of inaccurate and 22.222% of accurate. So, we can conclude that these algorithms are not suitable for predicting classes that are scarcely represented in a dataset.
3.137.178.9