The problem of overfitting data – the consequences explained

A common issue in machine learning is the problem of overfitting data. Generally, overfitting is used to refer to the phenomenon where, in the data used to train the model, the model performs better than it does on data not used to train the model (holdout data, future real use, and so on). Overfitting occurs when a model fits what is essentially noise in the training data. It appears to become more accurate as it accounts for the noise, but because the noise changes from one dataset to the next, this accuracy does not apply to any data but the training data—it does not generalize.

Overfitting can occur at any time but tends to become more severe as the ratio of parameters to information increases. Usually, this is can be thought of as the ratio of parameters to observations, but not always (for example, suppose the outcome is a rare event that occurs in 1 in 5 million people, a sample size of 15 million may still only have 3 people experiencing the event and would not support a complex model at all—information is low even though the sample size is large). To consider a simple but extreme case, imagine fitting a straight line to two data points. The fit will be perfect, and in those two training data your linear regression model will appear to have fully accounted for all variations in the data. However, if we then applied that line to another 1,000 cases, we might not expect it to fit very well at all.

In the previous section, we generated out-of-sample predictions for the RSNNS model we trained. We know that, in-sample, the accuracy was 87.6%. How good is that estimate? We can examine how well the model generalizes by checking the accuracy on the out-of-sample predictions using code that is by now quite familiar. Next we can see that it is still doing fairly well, but the accuracy is reduced to 83.6% on the holdout data. Here there appears to have been approximately a 4% loss; or, put differently, using training data to evaluate model performance resulted in an overly optimistic estimate of the accuracy, and that overestimate was 4%:

caret::confusionMatrix(xtabs(~digits.train[i2, 1] +
  I(encodeClassLabels(digits.yhat4.pred) - 1)))
Confusion Matrix and Statistics

                   I(encodeClassLabels(digits.yhat4.pred) - 1)
digits.train[i2, 1]   0   1   2   3   4   5   6   7   8   9
                  0 429   0  13  16   4   9   8   4   9   5
                  1   0 515   9   3   0   2   2   2   4   0
                  2   4   7 427  17   2   3  12  10  12   6
                  3   0   2  20 416   2  28   5  11  40   5
                  4   0   6   6   8 392   7  13   2  19  37
                  5   8   2   4  24  15 335  11   7  21  10
                  6   2   1   8   1   1   9 460   0   3   2
                  7   1  14  22   9   8   2   2 459   3  13
                  8   4  23  19  11  16  27   8   5 348  12
                  9   1   0   3  13  36  20   1  33   9 401

Overall Statistics
                                        
               Accuracy : 0.836         
                 95% CI : (0.826, 0.847)
    No Information Rate : 0.114         
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.818         
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            0.9555    0.904   0.8041   0.8031   0.8235
Specificity            0.9851    0.995   0.9837   0.9748   0.9783
Pos Pred Value         0.8632    0.959   0.8540   0.7864   0.8000
Neg Pred Value         0.9956    0.988   0.9769   0.9772   0.9814
Prevalence             0.0898    0.114   0.1062   0.1036   0.0952
Detection Rate         0.0858    0.103   0.0854   0.0832   0.0784
Detection Prevalence   0.0994    0.107   0.1000   0.1058   0.0980
Balanced Accuracy      0.9703    0.949   0.8939   0.8889   0.9009
                     Class: 5 Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity            0.7579   0.8812   0.8612   0.7436   0.8167
Specificity            0.9776   0.9940   0.9834   0.9724   0.9743
Pos Pred Value         0.7666   0.9446   0.8612   0.7357   0.7756
Neg Pred Value         0.9766   0.9863   0.9834   0.9735   0.9799
Prevalence             0.0884   0.1044   0.1066   0.0936   0.0982
Detection Rate         0.0670   0.0920   0.0918   0.0696   0.0802
Detection Prevalence   0.0874   0.0974   0.1066   0.0946   0.1034
Balanced Accuracy      0.8678   0.9376   0.9223   0.8580   0.8955

Since we fitted several models earlier of varying complexity, we could examine the degree of overfitting or overly optimistic accuracy from in-sample versus out-of-sample performance measures across them. The code is not shown as it is just a repetition of what we have already done, but it is available in the code bundle provided with the book. The results are shown in Figure 2.6:

The problem of overfitting data – the consequences explained

Figure 2.6

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.17.140