In statistical learning, the bias of a model refers to the error of the model introduced by attempting to model a complicated real-life relationship with an approximation. A model with no bias will never make any errors in prediction (like the cookie-area prediction problem). A model with high bias will fail to accurately predict its dependent variable.
The variance of a model refers to how sensitive a model is to changes in the data that built the model. A model with low variance would change very little when built with new data. A linear model with high variance is very sensitive to changes to the data that it was built with, and the estimated coefficients will be unstable.
The term bias-variance tradeoff illustrates that it is easy to decrease bias at the expense of increasing variance, and vice-versa. Good models will try to minimize both.
Figure 8.9 depicts two extremes of the bias-variance tradeoff. The left-most model depicts a complicated and highly convoluted model that passes through all the data points. This model has essentially no bias, as it has no error when predicting the data that it was built with. However, the model is clearly picking up on random noise in the data set, and if the model were used to predict new data, there would be significant error. If the same general model were rebuilt with new data, the model would change significantly (high variance).
As a result, the model is not generalizable to new data. Models like this suffer from overfitting, which often occurs when overly complicated or overly flexible models are fitted to data—especially when sample size is lacking.
In contrast, the model on the right panel of Figure 8.9 is a simple model (the simplest, actually). It is just a horizontal line at the mean of the dependent variable, mpg
. This does a pretty terrible job modeling the variance in the dependent variable, and exhibits high bias. This model does have one attractive property though—the model will barely change at all if fit to new data; the horizontal line will just move up or down slightly based on the mean of the mpg
column of the new data.
To demonstrate that our kitchen sink regression puts us on the wrong side of the optimal point in the bias-variance tradeoff, we will use a model validation and assessment technique called cross-validation.
Given that the goal of predictive analytics is to build generalizable models that predict well for data yet unobserved, we should ideally be testing our models on data unseen, and check our predictions against the observed outcomes. The problem with that, of course, is that we don't know the outcomes of data unseen—that's why we want a predictive model. We do, however, have a trick up our sleeve, called the validation set approach.
The validation set approach is a technique to evaluate a model's ability to perform well on an independent dataset. But instead of waiting to get our hands on a completely new dataset, we simulate a new dataset with the one we already have.
The main idea is that we can split our dataset into two subsets; one of these subsets (called the training set) is used to fit our model, and then the other (the testing set) is used to test the accuracy of that model. Since the model was built before ever touching the testing set, the testing set serves as an independent data source of prediction accuracy estimates, unbiased by the model's precision attributable to its modeling of idiosyncratic noise.
To get at our predictive accuracy by performing our own validation set, let's use the sample
function to divide the row indices of mtcars
into two equal groups, create the subsets, and train a model on the training set:
> set.seed(1) > train.indices <- sample(1:nrow(mtcars), nrow(mtcars)/2) > training <- mtcars[train.indices,] > testing <- mtcars[-train.indices,] > model <- lm(mpg ~ ., data=training) > summary(model) ….. (output truncated) Residual standard error: 1.188 on 5 degrees of freedom Multiple R-squared: 0.988, Adjusted R-squared: 0.9639 F-statistic: 41.06 on 10 and 5 DF, p-value: 0.0003599
Before we go on, note that the model now explains a whopping 99% of the variance in mpg
. Any this high should be a red flag; I've never seen a legitimate model with an R-squared this high on a non-contrived dataset. The increase in is attributable primarily due to the decrease in observations (from 32 to 16) and the resultant increased opportunity to model spurious correlations.
Let's calculate the MSE of the model on the training dataset. To do this, we will be using the predict
function without the newdata
argument, which tells us the model it would predict on given the training data (these are referred to as the fitted values):
> mean((predict(model) - training$mpg) ^ 2) [1] 0.4408109 # Cool, but how does it perform on the validation set? > mean((predict(model, newdata=testing) - testing$mpg) ^ 2) [1] 337.9995
My word!
In practice, the error on the training data is almost always a little less than the error on the testing data. However, a discrepancy in the MSE between the training and testing set as large as this is a clear-as-day indication that our model doesn't generalize.
Let's compare this model's validation set performance to a simpler model with a lower , which only uses am
and wt
as predictors:
> simpler.model <- lm(mpg ~ am + wt, data=training) > mean((predict(simpler.model) - training$mpg) ^ 2) [1] 9.396091 > mean((predict(simpler.model, newdata=testing) - testing$mpg) ^ 2) [1] 12.70338
Notice that the MSE on the training data is much higher, but our validation set MSE is much lower.
If the goal is to blindly maximize the , the more predictors, the better. If the goal is a generalizable and useful predictive model, the goal should be to minimize the testing set MSE.
The validation set approach outlined in the previous paragraph has two important drawbacks. For one, the model was only built using half of the available data. Secondly, we only tested the model's performance on one testing set; at the slight of a magician's hand, our testing set could have contained some bizarre hard-to-predict examples that would make the validation set MSE too large.
Consider the following change to the approach: we divide the data up, just as before, into set a and set b. Then, we train the model on set a, test it on set b, then train it on b and test it on a. This approach has a clear advantage over our previous approach, because it averages the out-of-sample MSE of two testing sets. Additionally, the model will now be informed by all the data. This is called two-fold cross validation, and the general technique is called k-fold cross validation.
To see how k-fold cross validation works in a more general sense, consider the procedure to perform k-fold cross validation where k=5. First, we divide the data into five equal groups (sets a, b, c, d, and e), and we train the model on the data from sets a, b, c, and d. Then we record the MSE of the model against unseen data in set e. We repeat this four more times—leaving out a different set and testing the model with it. Finally, the average of our five out-of-sample MSEs is our five-fold cross validated MSE.
Your goal, now, should be to select a model that minimizes the k-fold cross validation MSE. Common choices of k are 5 and 10.
To perform k-fold cross validation, we will be using the cv.glm
function from the boot
package. This will also require us to build our models using the glm
function (this stands for generalized linear models, which we'll learn about in the next chapter) instead of lm
. For current purposes, it is a drop-in replacement:
> library(boot) > bad.model <- glm(mpg ~ ., data=mtcars) > better.model <- glm(mpg ~ am + wt + qsec, data=mtcars) > > bad.cv.err <- cv.glm(mtcars, bad.model, K=5) > # the cross-validated MSE estimate we will be using > # is a bias-corrected one stored as the second element > # in the 'delta' vector of the cv.err object > bad.cv.err$delta[2] [1] 14.92426 > > better.cv.err <- cv.glm(mtcars, better.model, K=5) > better.cv.err$delta[2] [1] 7.944148
The use of k-fold cross validation over the simple validation set approach has illustrated that the kitchen-sink model is not as bad as we previously thought (because we trained it using more data), but it is still outperformed by the far simpler model that includes only am
, wt
, and qsec
as predictors.
This out-performance by a simple model is no idiosyncrasy of this dataset; it is a well-observed phenomenon in predictive analytics. Simpler models often outperform overly complicated models because of the resistance of a simpler model to overfitting. Further, simpler models are easier to interpret, to understand, and to use. The idea that, given the same level of predictive power, we should prefer simpler models to complicated ones is expressed in a famous principle called Occam's Razor.
Finally, we have enough background information to discuss the only piece of the lm
summary output we haven't touched upon yet: adjusted R-squared. Adjusted attempts to take into account the fact that extraneous variables thrown into a linear model will always increase its . Adjusted , therefore, takes the number of predictors into account. As such, it penalizes complex models. Adjusted will always be equal to or lower than non-adjusted (it can even go negative!). The addition of each marginal predictor will only cause an increase in adjusted if it contributes significantly to the predictive power of the model, that is, more than would be dictated by chance. If it doesn't, the adjusted will decrease. Adjusted has some great properties, and as a result, many will try to select models that maximize the adjusted , but I prefer the minimization of cross-validated MSE as my main model selection criterion.
Compare for yourself the adjusted of the kitchen-sink model and a model using am
, wt
, and qsec
.
As Figure 8.10 depicts, as a model becomes more complicated/flexible—as it starts to include more and more predictors—the bias of the model continues to decrease. Along the complexity axis, as the model begins to fit the data better and better, the cross-validation error decreases as well. At a certain point, the model becomes overly complex, and begins to fit idiosyncratic noise in the training data set—it overfits! The cross-validation error begins to climb again, even as the bias of the model approaches its theoretical minimum!
The very left of the plot depicts models with too much bias, but little variance. The right side of the plot depicts models that have very low bias, but very high variance, and thus, are useless predictive models.
The ideal point in this bias-variance tradeoff is at the point where the cross-validation error (not the training error) is minimized.
Okay, so how do we get there?
Although there are more advanced methods that we'll touch on in the section called Advanced Topics, at this stage of the game, our primary recourse for finding our bias-variance tradeoff sweet spot is careful feature selection.
In statistical learning parlance, feature selection refers to selecting which predictor variables to include in our model (for some reason, they call predictor variables features).
I emphasized the word careful, because there are plenty of dangerous ways to do this. One such method—and perhaps the most intuitive—is to simply build models containing every possible subset of the available predictors, and choose the best one as measured by Adjusted or the minimization of cross-validated error. Probably, the biggest problem with this approach is that it's computationally very expensive—to build a model for every possible subset of predictors in mtcars
, you would need to build (and cross validate) 1,023 different models. The number of possible models rises exponentially with the number of predictors. Because of this, for many real-world modeling scenarios, this method is out of the question.
There is another approach that, for the most part, solves the problem of the computational intractability of the all-possible-subsets approach: step-wise regression.
Stepwise regression is a technique that programmatically tests different predictor combinations by adding predictors in (forward stepwise), or taking predictors out (backward stepwise) according the value that each predictor adds to the model as measured by its influence on the adjusted . Therefore, like the all-possible-subsets approach, stepwise regression automates the process of feature selection.
There are numerous problems with this approach. The least of these is that it is not guaranteed to find the best possible model.
One of the primary issues that people cite is that it results in lazy science by absolving us of the need to think out the problem, because we let an automated procedure make decisions for us. This school of thought usually holds that models should be informed, at least partially, by some amount of theory and domain expertise.
It is for these reasons that stepwise regression has fallen out of favor among many statisticians, and why I'm choosing not to recommend using it.
Stepwise regression is like alcohol: some people can use it without incident, but some can't use it safely. It is also like alcohol in that if you think you need to use it, you've got a big problem. Finally, neither can be advertised to children.
At this stage of the game, I suggest that your main approach to balancing bias and variance should be informed theory-driven feature selection, and paying close attention to k-fold cross validation results. In cases where you have absolutely no theory, I suggest using regularization, a technique that is, unfortunately, beyond the scope of this text. The section Advanced topics briefly extols the virtues of regularization, if you want more information.
18.227.102.195