Occam's razor – simplicity and accuracy

When choosing among alternatives, there is a guiding principle known as Occam's razor that loosely states the following:

If we have two or more equivalent explanations for the same phenomenon, we should choose the simpler one.

There are many justifications for this heuristic; one of them is related to the falsifiability criterion introduced by Popper. Another takes a pragmatic perspective and states that: Given simpler models are easier to understand than more complex models, we should keep the simpler one. Another justification is based on Bayesian statistics, as we will see when we discus Bayes factors. Without getting into the details of these justifications, we are going to accept this criterion as a useful rule of thumb for the moment, just something that sounds reasonable.

Another factor we generally should take into account when comparing models is their accuracy, that is, how well the model fits the data. We have already seen some measures of accuracy, such as the coefficient of determination , that we can interpret as the proportion of explained variance in a linear regression, also that the posterior predictive checks are based on the idea of accuracy of the data. If we have two models and one of them explains the data better than the other, we should prefer that model, that is, we want the model with higher accuracy, right?

Intuitively, it seems that when comparing models, we tend to like those that have high accuracy and those that are simple. So far, so good, but what should we do if the simpler model has the worst accuracy? And, more generally, how can we balance both contributions?

During the rest of this chapter, we are going to discuss this idea of balancing between these two contributions. This chapter is more theoretical than the previous chapters (even when we are just scratching the surface of this topic). However, we are going to use code, figures, and examples that will help us move from this (correct) intuition of balancing accuracy versus complexity to a more theoretical (or at least empirical) grounded justification.

We are going to begin by fitting increasingly complex polynomials to a very simple dataset. Instead of using the Bayesian machinery, we are going to use the least square approximation for fitting linear models. Remember that the latter can be interpreted from a Bayesian perspective as a model with flat priors. So, in a sense, we are still being Bayesian here, only we are taking a bit of a shortcut:

x = np.array([4., 5., 6., 9., 12, 14.])
y = np.array([4.2, 6., 6., 9., 10, 10.])

plt.figure(figsize=(10, 5))
order = [0, 1, 2, 5]
plt.plot(x, y, 'o')
for i in order:
x_n = np.linspace(x.min(), x.max(), 100)
coeffs = np.polyfit(x, y, deg=i)
ffit = np.polyval(coeffs, x_n)

p = np.poly1d(coeffs)
yhat = p(x)
ybar = np.mean(y)
ssreg = np.sum((yhat-ybar)**2)
sstot = np.sum((y - ybar)**2)
r2 = ssreg / sstot

plt.plot(x_n, ffit, label=f'order {i}, $R^2$= {r2:.2f}')

plt.legend(loc=2)
plt.xlabel('x')
plt.ylabel('y', rotation=0)

Figure 5.5
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.9.223