A comparison of methods

We can now test the methods discussed in this chapter to solve a regression problem and a classification problem. To avoid overfitting, the dataset is typically split into two sets: the training set, in which the model parameters are fitted, and a test set, where the accuracy of the model is evaluated. However, it may be necessary to use a third set, the validation set, in which the hyperparameters (for example, C and A comparison of methods for SVM, or α in ridge regression) can be optimized. The original dataset may be too small to allow splitting into three sets, and also the results may be affected by the particular choice of data points on the training, validation, and test sets. A common way to solve this issue is by evaluating the model following the so-called cross-validation procedure—the dataset is split into k subsets (called folds) and the model is trained as follows:

  • A model is trained using k-1 of the folds as the training data.
  • The resulting model is tested on the remaining part of the data.
  • This procedure is repeated as many times as the number of folds decided at the beginning, each time with different k-1 folds to train (and consequently different test fold). The final accuracy is obtained by the average of the accuracies obtained on the different k iterations.

Regression problem

We are using the housing dataset of Boston's suburbs stored at http://archive.ics.uci.edu/ml/datasets/Housing and in the author's repository (https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/), in which the code used in this paragraph is also available. The dataset has 13 features:

  • CRIM: Per capita crime rate by town
  • ZN: Proportion of residential land zoned for lots over 25,000 sqft
  • INDUS: Proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: Nitric oxides concentration (parts per 10 million)
  • RM: Average number of rooms per dwelling
  • AGE: Proportion of owner-occupied units built prior to 1940
  • DIS: Weighted distances from five Boston employment centers
  • RAD: Index of accessibility to radial highways
  • TAX: Full-value property tax rate per $10,000
  • PTRATIO: Pupil-teacher ratio by town
  • B: 1000(Bk - 0.63)^2, where Bk is the proportion of blacks by town
  • LSTAT: The percentage of lower status of the population and the labels that we want to predict are MEDV, which represent the house value values (in $1000)

To evaluate the quality of the models, the mean squared error defined in the introduction and the coefficient of determination, R2, are calculated. R2 is given by:

Regression problem

Here, yipred indicates the predicted label from the model.

The best result is R2=1, which means the model perfectly fits the data, while R2=0 is associated with a model with a constant line (negative values indicate an increasingly worse fit). The code to compute to train the linear regression, ridge regression, lasso regression, and SVM regression using the sklearn library is as follows (IPython notebook at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/):

Regression problem
Regression problem

The housing data is loaded using the pandas library and reshuffled to randomize the cross-validation folds subset data (10 folds have been used) by applying the function df.iloc[np.random.permutation(len(df))]. The output of this script is as follows:

Regression problem

The best model fit is obtained using a random forest (with 50 trees); it returns an average coefficient of determination of 0.86 and MSE=11.5. As expected, the decision tree regressor has a lower R2 and higher MSE than the random forest (0.67 and 25 respectively). The support vector machine with the rbf kernel (C=1, Regression problem) is the worst model, with a huge MSE error 83.9 and 0.0 at R2, while SVM with the linear kernel (C=1, Regression problem) returns a decent model (0.69 R2 and 25.8 MSE). The lasso and ridge regressors have comparable results, around 0.7 R2 and 24 MSE. An important procedure to improve the model results is feature selection. It often happens that only a subset of the total features is relevant to perform the model training while the other features may not contribute at all to the model R2. Feature selection may improve R2 because misleading data is disregarded and training time is reduced (fewer features to consider).

There are many techniques for extracting the best features for a certain model, but in this context, we explore the so-called recursive feature elimination method (RSE), which essentially considers the attributes associated with the largest absolute weights until the desired number of features are selected. In the case of the SVM algorithm, the weights are just the values of w, while for regression, they are the model parameters θ. Using the sklearn built-in function RFE specifying only the best four attributes (best_features):

Regression problem

The output is:

Regression problem

The RFE function returns a list of Booleans (the support_ attribute) to indicate which features are selected (true values) and which are not (false values). The selected features are then used to evaluate the model as we have done before.

Even by using only four features, the best model remains the random forest with 50 trees, and the R2 is just marginally lower than that for the model trained with the full set of features (0.82 against 0.86). The other models—lasso, ridge, decision tree, and linear SVM regressors—have a more significant R2 drop, but the results are still comparable with their corresponding full-trained models. Note that the KNN algorithm does not provide weights on the features, so the RFE method cannot be applied.

Classification problem

To test the classifiers learned in this chapter, the dataset about car evaluation quality (inaccurate, accurate, good, and very good) based on six features that describe the main characteristics of a car (buying price, maintenance cost, number of doors, number of persons to carry, size of luggage boot, and safety). The dataset can be found at http://archive.ics.uci.edu/ml/datasets/Car+Evaluation or on my GitHub account, together with the code discussed here (https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/). To evaluate the accuracy of the classification, we will use the precision, recall, and f-measure. Given a dataset with only two classes (positive and negative), we define the number of true positive points (tp) the points correctly labeled as positive, the number of false positive (fp) the points wrongly labeled as positive (negative points) and the number of false negative (fn) the number of points erroneously assigned to the negative class. Using these definitions, the precision, recall and f-measure can be calculated as:

Classification problem
Classification problem
Classification problem

In a classification problem, a perfect precision (1.0) for a given class C means that each point assigned to class C belongs to class C (there is no information about the number of points from class C erroneously labeled), whereas a recall equal to 1.0 means that each point from class C was labeled as belonging to class C (but there is no information about the other points wrongly assigned to class C).

Note that in the case of multiple classes, these metrics are usually calculated as many times the number of labels, each time considering a class as the positive and all others as the negative. Different averages over the multiple classes' metrics are then used to estimate the total precision, recall, and f-measure.

The code to classify the cars dataset is as follows. First, we load all the libraries and the data into a pandas data frame.

Classification problem

The following are the feature values that are categorical:

buying 0      v-high, high, med, low
maintenance 1  v-high, high, med, low
doors 2       2, 3, 4, 5-more
persons 3     2, 4, more
lug_boot 4    small, med, big
safety 5      low, med, high
car evaluation 6 unacc,acc,good,vgood

These are mapped into numbers to be used in the classification algorithms:

Classification problem

Since we need to calculate and save the measures for all the methods, we write a standard function, CalcMeasures, and divide the labels' vector Y from the features X:

Classification problem

A 10 crossing validation folds has been used and the code is:

Classification problem

The measures' values are stored in the data frames:

Classification problem

Each measure has been evaluated four times—the number of car evaluation classes that fill the arrays according to the index mapping:

'acc': 0, 'unacc': 2, 'good': 1, 'vgood': 3

The best model is SVM with rbf kernel (C=50), but random forest (50 trees) and decision trees also return excellent results (measures over 0.9 for all the four classes). Naive Bayes, logistic regression, and SVM with linear kernel (C=50) return poor models, especially for the accurate, good, and very good classes, because there are few points with those labels:

Classification problem

In percentage, the very good (v-good) and good are 3.993% and 3.762% respectively, compared to 70.0223% of inaccurate and 22.222% of accurate. So, we can conclude that these algorithms are not suitable for predicting classes that are scarcely represented in a dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.220.92