Example of random forest with Scikit-Learn

In this example, we are going to use the famous Wine dataset (178 13-dimensional samples split into three classes) that is directly available in Scikit-Learn. Unfortunately, it's not so easy to find good and simple datasets for ensemble learning algorithms, as they are normally employed with large and complex sets that require too long a computational time. As the Wine dataset is not particularly complex, the first step is to assess the performances of different classifiers (logistic regression, decision tree, and polynomial SVM) using a k-fold cross-validation:

import numpy as np

from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

X, Y = load_wine(return_X_y=True)

lr = LogisticRegression(max_iter=1000, random_state=1000)
print(np.mean(cross_val_score(lr, X, Y, cv=10)))
0.956432748538

dt = DecisionTreeClassifier(criterion='entropy', random_state=1000)
print(np.mean(cross_val_score(dt, X, Y, cv=10)))
0.933298933609

svm = SVC(kernel='poly', random_state=1000)
print(np.mean(cross_val_score(svm, X, Y, cv=10)))
0.961403508772

As expected, the performances are quite good, with a top value of average cross-validation accuracy equal to about 96% achieved by the polynomial (the default degree is 3) SVM. A very interesting element is the performance of the decision tree, the worst of the set (with Gini impurity it's lower). Even if it's not correct, we can define this model as the weakest of the group and it's a perfect candidate for our bagging test. We can now fit a Random Forest by instantiating the class RandomForestClassifier and selecting n_estimators=50 (I invite the reader to try different values):

from multiprocessing import cpu_count

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=50, n_jobs=cpu_count(), random_state=1000)
print(np.mean(cross_val_score(rf, X, Y, cv=10)))
0.983333333333

As expected, the average cross-validation accuracy is the highest, about 98.3%. Therefore, the random forest has successfully found a global configuration of decision trees, so as to specialize them in almost any region of the sample space. The parameter n_jobs=cpu_count() tells Scikit-Learn to parallelize the training process using all of the CPU cores available in the machine.

To better understand the dynamics of this model, it's useful to plot the cross-validation accuracy as a function of the number of trees:

Cross-validation accuracy of a random forest as a function of the number of trees

It's not surprising to observe some oscillations and a plateau when the number of trees becomes greater at about 320. The effect of the randomness can cause a performance loss, even increasing the number of learners. In fact, even if the training accuracy grows, the validation accuracy on different folds can be affected by an over-specialization. Moreover, in this case, it's very interesting to notice that the top accuracy is achievable with 50 trees instead of 400 or more. For this reason, I always suggest performing at least a grid search, in order not only to achieve the best accuracy but also to minimize the complexity of the model.

Another important element to consider when working with decision trees and random forests is feature importance (also called Gini importance when this criterion is chosen), which is a measure proportional to the impurity reduction that a particular feature allows us achieve. For a decision tree, it is defined as follows:

In the previous formula, n(j) denotes the number of samples reaching the node j (the sum must be extended to all nodes where the feature is chosen) and ΔIi is the impurity reduction achieved at node j after splitting using the feature i. In a random forest, the importance must be computed by averaging over all trees:

After fitting a model (decision tree or random forest), Scikit-Learn outputs the feature importance vector in the feature_importances_ instance variable. In the following graph, there's a plot showing the importance of each feature (the labels can be obtained with the command load_wine()['feature_names']) in descending order:

Feature importances for Wine dataset

We don't want to analyze the chemical meaning of each element, but it's clear that, for example, the presence of proline and the color intensity are much more important than the presence of non-flavonoid phenols. As the model is working with features that are semantically independent (it's not the same for the pixels of an image), it's possible to reduce the dimensionality of a dataset by removing all those features whose importance doesn't have a high impact on the final accuracy. This process, called feature selection, should be performed using more complex statistical techniques, such as Chi-squared, but when a classifier is able to produce an importance index, it's also possible to use a Scikit-Learn class called SelectFromModel. Passing an estimator (that can be fitted or not) and a threshold, it's possible to transform the dataset by filtering out all the features whose value is below the threshold. Applying it to our model and setting a minimum importance equal to 0.02, we get the following:

from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(estimator=rf, prefit=True, threshold=0.02)
X_sfm = sfm.transform(X)

print(X_sfm.shape)
(178, 10)

The new dataset now contains 10 features instead of the 13 of the original Wine dataset (for example., it's easy to verify that ash and non-flavonoid phenols have been removed). Of course, as for any other dimensionality reduction method, it's always suggested you verify the final accuracy with a cross-validation and make decisions only if the trade-off between loss of accuracy and complexity reduction is reasonable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.9.169