Too many parameters leads to overfitting

From Figure 5.5, we can see that increasing the complexity of the model is accompanied by an increasing accuracy that's reflected in the coefficient of determination ; in fact, we can see that the polynomial of order 5 fits the data perfectly! You may remember that we briefly discussed this behavior of polynomials in Chapter 3, Modeling with Linear Regression, and we also discussed that, in general, it is not a very good idea to use polynomials for real problems.

Why is the polynomial of order 5 able to capture the data without missing a single data point? The reason for this is that we have the same number of parameters, 6, as the number of data points, which is also 6, and hence, the model is just encoding the data in a different way. The model is not really learning something from the data, it is just memorizing stuff! From this example, we can see that a model with higher accuracy is not always what we really want.

Imagine that we get more money or time and hence we collect more data points to include in the previous dataset. For example, we collect the points [(10, 9), (7,7)] (see Figure 5.6). How well does the order 5 model explain these points compared to the order 1 or 2 models? Not very well, right? The order 5 model did not learn any interesting pattern in the data; instead, it just memorized stuff (sorry for persisting with this idea) and hence the order 5 model does a very bad job at generalizing to future, unobserved, but potentially observable, data:

Figure 5.6

When a model fits very well with the dataset that was used to learn its parameters in the first place, but is very bad at fitting other datasets, we say that we have overfitting. This is a very general problem in statistics and machine learning. A very useful way of picturing the problem of overfitting is by thinking that datasets have two components: the signal and the noise. The signal is whatever we want to learn from the data. If we use a dataset, it is because we think there is a signal in there, otherwise it will be an exercise in futility. The noise, on the other hand, is not useful and is the product of measuring errors, limitations on the way data was generated or captured, corrupted data, and so on. A model overfits when it is so flexible that it learns the noise, effectively hiding the signal. This is a practical justification for Occam's razor. At least in principle, we can always create a model so complex that it explains everything in detail, as in the Empire, as described by Borges, where cartographers attained such a level of sophistication that they created a map of the Empire whose size was that of the Empire, and which coincided point for point with it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.236.70