Controlling the complexity of decision trees

If you continue to grow a tree until all leaves are pure, you will typically arrive at a tree that is too complex to interpret. The presence of pure leaves means that the tree is 100 percent correct on the training data, as was the case with our tree shown earlier. As a result, the tree is likely to perform very poorly on the test dataset, as was the case with our tree shown earlier. We say the tree overfits the training data.

There are two common ways to avoid overfitting:

Pre-pruning: This is the process of stopping the creation of the tree early.
Post-pruning (or just pruning): This is the process of first building the tree but then removing or collapsing nodes that contain only a little information.

There are several ways to pre-prune a tree, all of which can be achieved by passing optional arguments to the DecisionTreeClassifier constructor:

Limiting the maximum depth of the tree via the max_depth parameter
Limiting the maximum number of leaf nodes via max_leaf_nodes
Requiring a minimum number of points in a node to keep splitting it via min_samples_split

Often pre-pruning is sufficient to control overfitting.

Try it out on our toy dataset! Can you get the score on the test set to improve at all? How does the tree layout change as you start playing with the earlier parameters?

In more complicated real-world scenarios, pre-pruning is no longer sufficient to control overfitting. In such cases, we want to combine multiple decision trees into what is known as a random forest. We will talk about this in Chapter 10, Ensemble Methods for Classification.

Table of Contents for Controlling the complexity of decision trees

Create new playlist

Sign In

Sign Up

Table of Contents for
Controlling the complexity of decision trees