Out-of-bag testing

Random forests offer the benefit of built-in cross-validation because individual trees are trained on bootstrapped versions of the training data. As a result, each tree uses on average only two-thirds of the available observations. To see why, consider that a bootstrap sample has the same size, n, as the original sample, and each observation has the same probability, 1/n, to be drawn. Hence, the probability of not entering a bootstrap sample at all is (1-1/n)n, which converges (quickly) to 1/e, or roughly one-third. 

This remaining one-third of the observations that are not included in the training set used to grow a bagged tree is called out-of-bag (OOB) observations and can serve as a validation set. Just as with cross-validation, we predict the response for an OOB sample for each tree built without this observation, and then average the predicted responses (if regression is the goal) or take a majority vote or predicted probability (if classification is the goal) for a single ensemble prediction for each OOB sample. These predictions produce an unbiased estimate of the generalization error, conveniently computed during training.

The resulting OOB error is a valid estimate of the generalization error for this observation because the prediction is produced using decision rules learned in the absence of this observation. Once the random forest is sufficiently large, the OOB error closely approximates the leave-one-out cross-validation error. The OOB approach to estimate the test error is very efficient for large datasets where cross-validation can be computationally costly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.193.80.126