When the goal of using regression is simply predictive modeling, we often don't care about which particular predictors go into our model, so long as the final model yields the best possible predictions.
A naïve (and awful) approach is to use all the independent variables available to try to model the dependent variable. Let's try this approach by trying to predict mpg
from every other variable in the mtcars
dataset:
> # the period after the squiggly denotes all other variables > model <- lm(mpg ~ ., data=mtcars) > summary(model) Call: lm(formula = mpg ~ ., data = mtcars) Residuals: Min 1Q Median 3Q Max -3.4506 -1.6044 -0.1196 1.2193 4.6271 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.30337 18.71788 0.657 0.5181 cyl -0.11144 1.04502 -0.107 0.9161 disp 0.01334 0.01786 0.747 0.4635 hp -0.02148 0.02177 -0.987 0.3350 drat 0.78711 1.63537 0.481 0.6353 wt -3.71530 1.89441 -1.961 0.0633 . qsec 0.82104 0.73084 1.123 0.2739 vs 0.31776 2.10451 0.151 0.8814 am 2.52023 2.05665 1.225 0.2340 gear 0.65541 1.49326 0.439 0.6652 carb -0.19942 0.82875 -0.241 0.8122 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.65 on 21 degrees of freedom Multiple R-squared: 0.869, Adjusted R-squared: 0.8066 F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Hey, check out our R-squared value! It looks like our model explains 87% of the variance in the dependent variable. This is really good—it's certainly better than our simple regression models that used weight (wt
) and transmission (am
) with the respective R-squared values, 0.753 and 0.36.
Maybe there's something to just including everything we have in our linear models. In fact, if our only goal is to maximize our R-squared, you can always achieve this by throwing every variable you have into the mix, since the introduction of each marginal variable can only increase the amount of variance explained. Even if a newly introduced variable has absolutely no predictive power, the worst it can do is not help explain any variance in the dependent variable—it can never make the model explain less variance.
This approach to regression analysis is often (non-affectionately) called kitchen-sink regression, and is akin to throwing all of your variables against a wall to see what sticks. If you have a hunch that this approach to predictive modeling is crummy, your instinct is correct on this one.
To develop your intuition about why this approach backfires, consider building a linear model to predict a variable of only 32 observations using 200 explanatory variables, which are uniformly and randomly distributed. Just by random chance, there will very likely be some variables that correlate strongly to the dependent variable. A linear regression that includes some of these lucky variables will yield a model that is surprisingly (sometimes astoundingly) predictive.
Remember that when we are creating predictive models, we rarely (if ever) care about how well we can predict the data we already have. The whole point of predictive analytics is to be able to predict the behavior of data we don't have. For example, memorizing the answer key to last year's Social Studies final won't help you on this year's final, if the questions are changed—it'll only prove you can get an A+ on your last year's test.
Imagine generating a new random dataset of 200 explanatory variables and one dependent variable. Using the coefficients from the linear model of the first random dataset. How well do you think the model will perform?
The model will, of course, perform very poorly, because the coefficients in the model were informed solely by random noise. The model captured chance patterns in the data that it was built with and not a larger, more general pattern—mostly because there was no larger pattern to model!
In statistical learning parlance, this phenomenon is called overfitting, and it happens often when there are many predictors in a model. It is particularly frequent when the number of observations is less than (or not very much larger than) the number of predictor variables (like in mtcars
), because there is a greater probability for the many predictors to have a spurious relationship with the dependent variable.
This general occurrence—a model performing well on the data it was built with but poorly on subsequent data—illustrates perfectly perhaps the most common complication with statistical learning and predictive analytics: the bias-variance tradeoff.
18.226.34.205