If you continue to grow a tree until all leaves are pure, you will typically arrive at a tree that is too complex to interpret. The presence of pure leaves means that the tree is 100 percent correct on the training data, as was the case with our tree shown earlier. As a result, the tree is likely to perform very poorly on the test dataset, as was the case with our tree shown earlier. We say the tree overfits the training data.
There are two common ways to avoid overfitting:
- Pre-pruning: This is the process of stopping the creation of the tree early.
- Post-pruning (or just pruning): This is the process of first building the tree but then removing or collapsing nodes that contain only a little information.
There are several ways to pre-prune a tree, all of which can be achieved by passing optional arguments to the DecisionTreeClassifier constructor:
- Limiting the maximum depth of the tree via the max_depth parameter
- Limiting the maximum number of leaf nodes via max_leaf_nodes
- Requiring a minimum number of points in a node to keep splitting it via min_samples_split
Often pre-pruning is sufficient to control overfitting.
Try it out on our toy dataset! Can you get the score on the test set to improve at all? How does the tree layout change as you start playing with the earlier parameters?