Overfitting

Overfitting is arguably one of the most treacherous risks you face when building predictive models. This happens when the performance on the train data looks great (maybe too good), but the performance on the test data is much worse. We will have more to say about diagnosing overfitting in the next chapter, but we need to discuss ways of avoiding it now. In earlier chapters, we saw how to reclassify variables with a large number of categories into variables with fewer categories. These inputs with many categories are dangerous. The reason that they can lead to overfitting is that, despite large sample sizes, there may be only a handful of cases that have a rare combination of traits. Let's explore just two examples of this phenomenon in the dataset.

There are 622 PhDs in Train dataset. Of that number, the most common occupation is Professional Specialty with 396 PhDs in that specialty. A rare occupation within this group is Transportation and Moving materials. There are just two of them, neither of whom is earning more than $50,000. In contrast, in the Test data, 50% of this combination meet that threshold. Importantly there are also only two of them! Watch out for those rare categories that have 100% or 0% splits. They distract the algorithm in that the tree will want to split at those extremes and then when applied to new data, that part of the tree will fail.

Let's consider just one more rare combination. Professional degrees are not common in the dataset, but they are not rare either. There are 870 of these in the Test dataset. However, only 8 of those 870 have Precision craft and repair as their occupation. Twenty-five of those eight are in the higher income target category. In the Test data, the percentage is 75%. What may be tricky to understand at first is that we are not especially interested in the Test data, per se. We are interested in the danger to which it alerts us. We are not going to deploy our model on the Test data. We are going to deploy our model on new data that we don't have yet, and that possibly does not exist. It is data from the next year, and the years after that. But the Test dataset is acting like a canary in a coal mine and it is warning us that we are about to overfit the model. This kind of detailed analysis of each and every rare combination would be a very clumsy way to detect overfitting. We will be learning a more efficient approach in the next chapter using the Analysis node. For now, the important lesson learned is to watch out for data that might be too granular and might represent small sample sizes hiding away in rare combinations of variables:

Place the CHAID node from the Modeling palette on the stream canvas.
Connect the Partition node to the CHAID node.

The name of the CHAID node should immediately change to Income_category, since this is the target.

Table of Contents for Overfitting

Create new playlist

Sign In

Sign Up

Table of Contents for
Overfitting