Decision tree pruning

Recursive binary-splitting will likely produce good predictions on the training set but tends to overfit the data and produce poor generalization performance because it leads to overly complex trees, reflected in a large number of leaf nodes or partitioning of the feature space. Fewer splits and leaf nodes imply an overall smaller tree and often lead to better predictive performance as well as interpretability.

One approach to limit the number of leaf nodes is to avoid further splits unless they yield significant improvements of the objective metric. The downside of this strategy, however, is that sometimes splits that result in small improvements enable more valuable splits later on as the composition of the samples keeps changing.

Tree-pruning, in contrast, starts by growing a very large tree before removing or pruning nodes to reduce the large tree to a less complex and overfit subtree. Cost-complexity-pruning generates a sequence of subtrees by adding a penalty for adding leaf nodes to the tree model and a regularization parameter, similar to the lasso and ridge linear-regression models, that modulates the impact of the penalty. Applied to the large tree, an increasing penalty will automatically produce a sequence of subtrees. Cross-validation of the regularization parameter can be used to identify the optimal, pruned subtree.

This method is not yet available in sklearn; see references on GitHub for further details and ways to manually implement pruning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.206.14.46