Ensembling decision trees – random forest

The ensemble technique bagging (which stands for bootstrap aggregating), which we briefly mentioned in Chapter 1Getting Started with Machine Learning and Python, can effectively overcome overfitting. To recap, different sets of training samples are randomly drawn with replacements from the original training data; each resulting set is used to fit an individual classification model. The results of these separately trained models are then combined together through a majority vote to make the final decision.

Tree bagging, described in the preceding section, reduces the high variance that a decision tree model suffers from and hence, in general, performs better than a single tree. However, in some cases, where one or more features are strong indicators, individual trees are constructed largely based on these features and as a result become highly correlated. Aggregating multiple correlated trees will not make much difference. To force each tree to be uncorrelated, random forest only consider a random subset of the features when searching for the best splitting point at each node. Individual trees are now trained based on different sequential sets of features, which guarantees more diversity and better performance. Random forest is a variant tree bagging model with additional feature-based bagging.

To employ random forest in our click-through prediction project, we use the package from scikit-learn. Similar to the way we implemented the decision tree in the preceding section, we only tweak the max_depth parameter:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100,
criterion='gini', min_samples_split=30, n_jobs=-1)

Besides max_depth, min_samples_split, and class_weight, which are important hyperparameters related to a single decision tree, hyperparameters that are related to a random forest (a set of trees) such as n_estimators are also highly recommended:

>>> grid_search = GridSearchCV(random_forest, parameters,
n_jobs=-1, cv=3, scoring='roc_auc')
>>> grid_search.fit(X_train, y_train)
>>> print(grid_search.best_params_)
{'max_depth': None}

Use the model with the optimal parameter None for max_depth (nodes are expanded until another stopping criterion is met) to predict future unseen cases:

>>> random_forest_best = grid_search.bestestimator
>>> pos_prob = random_forest_best.predict_proba(X_test)[:, 1]

>>> print('The ROC AUC on testing set is:
{0:.3f}'.format(roc_auc_score(y_test, pos_prob)))
The ROC AUC on testing set is: 0.759

It turns out that the random forest model gives a substantial lift to the performance.

Let's summarize several critical hyperparameters to tune in random forest:

  • max_depth: This is the deepest individual tree. It tends to overfit if it is too deep, or to underfit if it is too shallow.
  • min_samples_split: This hyperparameter represents the minimum number of samples required for further splitting at a node. Too small a value tends to cause overfitting, while too large a value is likely to introduce underfitting. 10, 30, and 50 might be good options to start with.

The preceding two hyperparameters are generally related to individual decision trees. The following two parameters are more related to a random forest, a collection of trees:

  • max_features: This parameter represents the number of features to consider for each best splitting point search. Typically, for an m-dimensional dataset,  (rounded) is a recommended value for max_features. This can be specified as max_features="sqrt" in scikit-learn. Other options include log2, 20% of the original features to 50%.
  • n_estimators: This parameter represents the number of trees considered for majority voting. Generally speaking, the more trees, the better the performance, but more computation time. It is usually set as 100, 200, 500, and so on.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.212.212