9. More Regression Methods

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9. More Regression Methods

In [1]:

#setup
from mlwpy import *
%matplotlib inline

diabetes = datasets.load_diabetes()

d_tts = skms.train_test_split(diabetes.data,
                              diabetes.target,
                              test_size=.25,
                              random_state=42)

(diabetes_train_ftrs, diabetes_test_ftrs,
 diabetes_train_tgt, diabetes_test_tgt) = d_tts

We are going to dive into a few additional techniques for regression. All of these are variations on techniques we’ve seen before. Two are direct variations on linear regression, one splices a support vector classifier with linear regression to create a Support Vector Regressor, and one uses decision trees for regression instead of classification. As such, much of what we’ll talk about will be somewhat familiar. We’ll also discuss how to build a learner of our own that plugs directly into sklearn’s usage patterns. Onward!

9.1 Linear Regression in the Penalty Box: Regularization

As we briefly discussed in Section 5.4, we can conceptually define the goodness of a model as a cost that has two parts: (1) what we lose, or spend, when we make a mistake and (2) what we invest, or spend, into the complexity of our model. The math is pretty easy here: cost = loss(errors) + complexity. Keeping mistakes low keeps us accurate. If we keep the complexity low, we also keep the model simple. In turn, we also improve our ability to generalize. Overfitting has a high complexity and low training loss. Underfitting has low complexity and high training loss. At our sweet spot, we use just the right amount of complexity to get a low loss on training and testing.

Controlling the complexity term—keeping the complexity low and the model simple—is called regularization. When we discussed overfitting, we talked about some wiggly graphs being too wiggly: they overfit and follow noise instead of pattern. If we reduce some of the wiggliness, we do a better job of following the signal—the interesting pattern—and ignoring the noise.

When we say we want to regularize, or smooth, our model, we are really putting together a few ideas. Data is noisy: it combines noisy distractions with the real useful signal. Some models are powerful enough to capture both signal and noise. We hope that the pattern in the signal is reasonably smooth and regular. If the features of two examples are fairly close to each other, we hope they have similar target values. Excessive jumpiness in the target between two close examples is, hopefully, noise. We don’t want to capture noise. So, when we see our model function getting too jagged, we want to force it back to something smooth.

So, how can we reduce roughness? Let’s restrict ourselves to talking about one type of model: linear regression. I see a raised hand in the back of the classroom. “Yes?” “But Mark, if we are choosing between different straight lines, they all seem to be equally rough!” That’s a fair point. Let’s talk about what it might mean for one line to be simpler than another. You may want to review some of the geometry and algebra of lines from Section 2.6.1. Remember the basic form of a line: y = mx + b. Well, if we get rid of mx, we can have something that is even simpler, but still a line: y = b. A concrete example is y = 3. That is, for any value of x we (1) ignore that value of x—it plays no role now—and (2) just take the value on the right-hand side, 3. There are two ways in which y = b is simpler than y = mx + b:

y = b can only be a 100% correct predictor for a single data point, unless the other target values are cooperative. If an adversary is given control over the target values, they can easily break y = 3 by choosing a second point that has any value besides 3, say 42.
To fully specify a model y = b, I need one value: b. To fully specify a model y = mx + b, I need two values: m and b.

A quick point: if either (1) we set m = 0, getting back to y = b, or (2) we set b = 0 and we get y = mx, we have reduced our capacity to follow true patterns in the data. We’ve simplified the models we are willing to consider. If m = 0, we are literally back to the y = b case: we can only capture one adversarial point. If b = 0, we are in a slightly different scenario, but again, we can only capture one adversarial point. The difference in the second case is that we have an implicit y-intercept—usually specified by an explicit value for b—of zero. Specifically, the line will start at (0, 0) instead of (0, b). In both cases, if we take the zero as a given, we only need to remember one other value. Here’s the key idea: setting weights to zero simplifies a linear model by reducing the number of points it can follow and reducing the number of weights it has to estimate.

Now, what happens when we have more than one input feature? Our linear regression now looks like y = w₂x₂ + w₁x₁ + w₀. This equation describes a plane in 3D instead of a line in 2D. One way to reduce the model is to pick some ws—say, w₁ and w₀—and set them to zero. That gives us y = w₂x₂ which effectively says that we don’t care about the value of x₁ and we are content with an intercept of zero. Completely blanking out values seems a bit heavy-handed. Are there more gradual alternatives? Yes, I’m glad you asked.

Instead of introducing zeros, we can ask that the total size of the weights be relatively small. Of course, this constraint brings up problems—or, perhaps, opportunities. We must define total size and relatively small. Fortunately, total hints that we need to add several things up with sum. Unfortunately, we have to pick what we feed to sum.

We want the sum to represent an amount and we want things far away from zero to be counted equally—just as we have done with errors. We’d like 9 and –9 to be counted equally. We’ve dealt with this by (1) squaring values or (2) taking absolute values. As it turns out, we can reasonably use either here.

In [2]:

Name	Penalty	Math
GOF Linear Regression	None	$Σ_{i} {(y_{i} - {w x}_{i})}^{2}$ $Σ_{i} {(y_{i} - {w x}_{i})}^{2}$
Lasso	L₁	$Σ_{i} {(y_{i} - {w x}_{i})}^{2} + C Σ_{j} w_{j}$ $Σ_{i} {(y_{i} - {w x}_{i})}^{2} + C Σ_{j} w_{j}$
Ridge	L₂	$Σ_{i} {(y_{i} - {w x}_{i})}^{2} + C Σ_{j} w_{j}^{2}$ $Σ_{i} {(y_{i} - {w x}_{i})}^{2} + C Σ_{j} w_{j}^{2}$
SVR	L₁	$Σ_{i} max (\| y_{i} - {w x}_{i} \| - ∊, 0) + C Σ_{j} w_{j}$ $Σ_{i} max (\| y_{i} - {w x}_{i} \| - ∊, 0) + C Σ_{j} w_{j}$

Name	Loss	Penalty
GOF LR	L₁	0
Lasso (L₁-Penalty LR)	L₂	L₁
Ridge (L₂-Penalty LR)	L₂	L₂
SVR	Hinge	L₂

	RMSE
DecisionTreeRegressor1	4.3192
Ridge	4.3646
LinearRegression	4.3653
NuSVR	4.3896
SVR	4.4062
DecisionTreeRegressor3	4.4298
Lasso	4.4375
KNeighborsRegressor10	4.4873
DecisionTreeRegressor5	4.7410
KNeighborsRegressor3	4.8915
DecisionTreeRegressor10	5.3526

Table of Contents for 9. More Regression Methods

Create new playlist

Sign In

Sign Up

9. More Regression Methods

9.1 Linear Regression in the Penalty Box: Regularization

9.1.1 Performing Regularized Regression

9.2 Support Vector Regression

9.2.1 Hinge Loss

9.2.2 From Linear Regression to Regularized Regression to Support Vector Regression

9.2.3 Just Do It—SVR Style

9.3 Piecewise Constant Regression

9.3.1 Implementing a Piecewise Constant Regressor

9.3.2 General Notes on Implementing Models

9.4 Regression Trees

9.4.1 Performing Regression with Trees

9.5 Comparison of Regressors: Take Three

9.6 EOC

9.6.1 Summary

9.6.2 Notes

9.6.3 Exercises

Table of Contents for
9. More Regression Methods