Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Kitchen sink regression

When the goal of using regression is simply predictive modeling, we often don't care about which particular predictors go into our model, so long as the final model yields the best possible predictions.

A naïve (and awful) approach is to use all the independent variables available to try to model the dependent variable. Let's try this approach by trying to predict mpg from every other variable in the mtcars dataset:

  > # the period after the squiggly denotes all other variables
  > model <- lm(mpg ~ ., data=mtcars)
  > summary(model)

  Call:
  lm(formula = mpg ~ ., data = mtcars)

  Residuals:
      Min      1Q  Median      3Q     Max
  -3.4506 -1.6044 -0.1196  1.2193  4.6271

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) 12.30337   18.71788   0.657   0.5181
  cyl         -0.11144    1.04502  -0.107   0.9161
  disp         0.01334    0.01786   0.747   0.4635
  hp          -0.02148    0.02177  -0.987   0.3350
  drat         0.78711    1.63537   0.481   0.6353
  wt          -3.71530    1.89441  -1.961   0.0633 .
  qsec         0.82104    0.73084   1.123   0.2739
  vs           0.31776    2.10451   0.151   0.8814
  am           2.52023    2.05665   1.225   0.2340
  gear         0.65541    1.49326   0.439   0.6652
  carb        -0.19942    0.82875  -0.241   0.8122
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  Residual standard error: 2.65 on 21 degrees of freedom
  Multiple R-squared:  0.869,     Adjusted R-squared:  0.8066
  F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Hey, check out our R-squared value! It looks like our model explains 87% of the variance in the dependent variable. This is really good—it's certainly better than our simple regression models that used weight (wt) and transmission (am) with the respective R-squared values, 0.753 and 0.36.

Maybe there's something to just including everything we have in our linear models. In fact, if our only goal is to maximize our R-squared, you can always achieve this by throwing every variable you have into the mix, since the introduction of each marginal variable can only increase the amount of variance explained. Even if a newly introduced variable has absolutely no predictive power, the worst it can do is not help explain any variance in the dependent variable—it can never make the model explain less variance.

This approach to regression analysis is often (non-affectionately) called kitchen-sink regression, and is akin to throwing all of your variables against a wall to see what sticks. If you have a hunch that this approach to predictive modeling is crummy, your instinct is correct on this one.

To develop your intuition about why this approach backfires, consider building a linear model to predict a variable of only 32 observations using 200 explanatory variables, which are uniformly and randomly distributed. Just by random chance, there will very likely be some variables that correlate strongly to the dependent variable. A linear regression that includes some of these lucky variables will yield a model that is surprisingly (sometimes astoundingly) predictive.

Remember that when we are creating predictive models, we rarely (if ever) care about how well we can predict the data we already have. The whole point of predictive analytics is to be able to predict the behavior of data we don't have. For example, memorizing the answer key to last year's Social Studies final won't help you on this year's final, if the questions are changed—it'll only prove you can get an A+ on your last year's test.

Imagine generating a new random dataset of 200 explanatory variables and one dependent variable. Using the coefficients from the linear model of the first random dataset. How well do you think the model will perform?

The model will, of course, perform very poorly, because the coefficients in the model were informed solely by random noise. The model captured chance patterns in the data that it was built with and not a larger, more general pattern—mostly because there was no larger pattern to model!

In statistical learning parlance, this phenomenon is called overfitting, and it happens often when there are many predictors in a model. It is particularly frequent when the number of observations is less than (or not very much larger than) the number of predictor variables (like in mtcars), because there is a greater probability for the many predictors to have a spurious relationship with the dependent variable.

This general occurrence—a model performing well on the data it was built with but poorly on subsequent data—illustrates perfectly perhaps the most common complication with statistical learning and predictive analytics: the bias-variance tradeoff.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Kitchen sink regression

Create new playlist

Sign In

Sign Up

Kitchen sink regression

Table of Contents for
Kitchen sink regression