Using a random forest model

Decision trees, introduced in Chapter 13, Training a Machine Learning Model, and which we have been using so far, are fast and easy to interpret. Their weak point, however, is overfitting—many features might seem to be a great predictor on the training dataset, but turn out to mislead the models on the external data. In other words, they don't represent the general population. The problem is that decision trees (another algorithm) don't have any internal mechanics to detect and ignore those features.

A suite of more sophisticated models was developed on top of decision models to fight overfitting. These models are usually called tree ensembles, as all of them train multiple decision trees and aggregate their predictions. There are a few models in this family, namely, Adaptive Boosting, Extra-Trees, and random forest. The last one is, arguably, the simplest one to understand of them all—random forest is essentially a flat collection of decision trees, each of which was trained on the subset of the records and a subset of the features of given training data. The result is then aggregated as a majority vote (for classification) or weighted average (for regression). Because each tree gets its own subset of features and records, all trees are trained differently. The mere number and diversity of the trees results in the overall model's robustness and tolerance to overfitting and enables it to capture more nuanced dependencies, at the same time.

Let's see if we'll be able to squeeze more performance from our WWII data using the random forest model! Luckily, all sklearn models share the same interface, so we won't need to change much in the code. Let's run a model with the default properties, first:

In the following code, we initiate the model with the default values and a specific random state and run cross-validation on this model:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=2019, n_estimators=10)

scores = cross_val_score(rf, 
                         data2[cols2], 
                         data2['result_num'],
                         cv=4)

>>> pd.np.mean(scores)
0.5346861471861473

As you can see, it didn't outperform the decision tree just yet.

Let's now tune the parameters. As all of the parameters are the same, except for the number of estimators, we can reuse our old parameter distribution, as follows:

param_dist['n_estimators'] = sp_randint(50, 2000)

rs2 = RandomizedSearchCV(
    rf,
    param_distributions=param_dist,
    cv=4, iid=False,
    random_state=2019,
    n_iter=50
)

Now, let's run the search and check the results:

rs2.fit(data2[cols2], data2['result_num'])

>>> rs2.bestscore
0.5812229437229437

Indeed, now the model outperformed decision trees by ~2%. Great!

In this section, we were able to replace the decision tree model we were using before with another, more complex algorithm—random forest. As a result, just this swap allowed us to boost the accuracy of the prediction by 2%, and we didn't even run any hyperparameter optimization. Random forest is prone to overfitting, can learn more complex dependencies, and generally performs better than decision trees. There is no free lunch, though; the random forest is too complex to be directly interpretable (although there are tools such as SHAP (https://github.com/slundberg/shap) that can help with that) and takes longer to run.

As you can now see, there are plenty of directions to try and experiment. With all of this feature engineering, different models, hyperparameter optimization, and whatnot, it is easy to lose track of your work. While Git and GitHub are great for using code, they are not as useful for experimentation—you can't store your data, features, models, and metrics there. To help to track your progress and control different versions of your data and models, let's introduce DVC.

Table of Contents for Using a random forest model

Create new playlist

Sign In

Sign Up

Table of Contents for
Using a random forest model