How to train and tune a random forest

The key configuration parameters include the various hyperparameters for the individual decision trees introduced in the section How to tune the hyperparameters. The following tables lists additional options for the two RandomForest classes:

Keyword Default Description
bootstrap True Bootstrap samples during training.
n_estimators 10 Number of trees in the forest.
oob_score False Uses out-of-bag samples to estimate the R2 on unseen data.

The bootstrap parameter activates in the preceding bagging algorithm outline, which in turn enables the computation of the out-of-bag score (oob_score) that estimates the generalization accuracy using samples not included in the bootstrap sample used to train a given tree (see next section for detail). 

The n_estimators parameter defines the number of trees to be grown as part of the forest. Larger forests perform better, but also take more time to build. It is important to monitor the cross-validation error as a function of the number of base learners to identify when the marginal reduction of the prediction error declines and the cost of additional training begins to outweigh the benefits.

The max_features parameter controls the size of the randomly selected feature subsets available when learning a new decision rule and split a node. A lower value reduces the correlation of the trees and, thus, the ensemble's variance, but may also increase the bias. Good starting values are n_features (the number of training features) for regression problems and sqrt(n_features) for classification problems, but will depend on the relationships among features and should be optimized using cross-validation.

Random forests are designed to contain deep fully-grown trees, which can be created using max_depth=None and min_samples_split=2. However, these values are not necessarily optimal, especially for high-dimensional data with many samples and, consequently, potentially very deep trees that can become very computationally-, and memory-, intensive.

The RandomForest class provided by sklearn support parallel training and prediction by setting the n_jobs parameter to the k number of jobs to run on different cores. The -1 value uses all available cores. The overhead of interprocess communication may limit the speedup from being linear so that k jobs may take more than 1/k the time of a single job. Nonetheless, the speedup is often quite significant for large forests or deep individual trees that may take a meaningful amount of time to train when the data is large, and split evaluation becomes costly.

As always, the best parameter configuration should be identified using cross-validation. The following steps illustrate the process:

  1. We will use GridSearchCV to identify an optimal set of parameters for an ensemble of classification trees:
rf_clf = RandomForestClassifier(n_estimators=10,
criterion='gini',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=True, oob_score=False,
n_jobs=-1, random_state=42)
  1. We will use 10-fold custom cross-validation and populate the parameter grid with values for the key configuration settings:
cv = OneStepTimeSeriesSplit(n_splits=10)
clf = RandomForestClassifier(random_state=42, n_jobs=-1)
param_grid = {'n_estimators': [200, 400],
'max_depth': [10, 15, 20],
'min_samples_leaf': [50, 100]}
  1. Configure GridSearchCV using the preceding input:
gridsearch_clf = GridSearchCV(estimator=clf,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=-1,
cv=cv,
refit=True,
return_train_score=True,
verbose=1)
  1. Train the multiple ensemble models defined by the parameter grid:
gridsearch_clf.fit(X=X, y=y_binary)
  1. Obtain the best parameters as follows:
gridsearch_clf.bestparams
{'max_depth': 15, 'min_samples_leaf': 100, 'n_estimators': 400}
  1. The best score is a small but significant improvement over the single-tree baseline:
gridsearch_clf.bestscore_
0.6013
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.53.209