I would be negligent if I failed to mention the boring but very critical topic of the assumptions of linear models, and how to detect violations of those assumptions. Just like the assumptions of the hypothesis tests in Chapter 6, Testing Hypotheses linear regression has its own set of assumptions, the violation of which jeopardize the accuracy of our model—and any inferences derived from it—to varying degrees. The checks and tests that ensure these assumptions are met are called diagnostics.
There are five major assumptions of linear regression:
We'll briefly touch on these assumptions, and how to check for them in this section here. To do this, we will be using a residual-fitted plot, since it allows us, with some skill, to verify most of these assumptions. To view a residual-fitted plot, just call the plot
function on your linear model object:
> my.model <- lm(mpg ~ wt, data=mtcars) > plot(my.model)
This will show you a series of four diagnostic plots—the residual-fitted plot is the first. You can also opt to view just the residual-fitted plot with this related incantation:
> plot(my.model, which=1)
We are also going back to Anscombe's Quartet, since the quartet's aberrant relationships collectively illustrate the problems that you might find with fitting regression models and assumption violation. To re-familiarize yourself with the quartet, look back to Figure 8.6.
The first relationship in Anscombe's Quartet (y1 ~ x1) is the only one that can appropriately be modeled with linear regression as is. In contrast, the second relationship (y2 ~ x2) depicts a relationship that violates the requirement of a linear relationship. It also subtly violates the assumption of normally distributed residuals with a mean of zero. To see why, refer to Figure 8.11, which depicts its residual-fitted plot:
A non-pathological residual-fitted plot will have data points randomly distributed along the invisible horizontal line, where the y-axis equals 0. By default, this plot also contains a smooth curve that attempts to fit the residuals. In a non-pathological sample, this smooth curve should be approximately straight, and straddle the line at y=0.
As you can see, the first Anscombe relationship does this well. In contrast, the smooth curve of the second relationship is a parabola. These residuals could have been drawn from a normal distribution with a mean of zero, but it is highly unlikely. Instead, it looks like these residuals were drawn from a distribution—perhaps from a normal distribution—whose mean changed as a function of the x-axis. Specifically, it appears as if the residuals at the two ends were drawn from a distribution whose mean was negative, and the middle residuals had a positive mean.
We already dug deeper into this relationship when we spoke of robust regression earlier in the chapter. We saw that a robust fit of this relationship more of less ignored the clear outlier. Indeed, the robust fit is almost identical to the non-robust linear fit after the outlier is removed.
On occasion, a data point that is an outlier in the y-axis but not the x-axis (like this one) doesn't influence the regression line much—meaning that its omission wouldn't cause a substantial change in the estimated intercept and coefficients.
A data point that is an outlier in the x-axis (or axes) is said to have high leverage. Sometimes, points with high leverage don't influence the regression line much, either. However, data points that have high leverage and are outliers very often exert high influence on the regression fit, and must be handled appropriately.
Refer to the upper-right panel of Figure 8.12. The aberrant data point in the fourth relationship of Anscombe's quartet has very high leverage and high influence. Note that the slope of the regression line is completely determined by the y-position of that point.
The following image depicts some of the linear regression diagnostic plots of the fourth Anscombe relationship:
Although it's difficult to say for sure, this is probably in violation of the assumption of constant variance of residuals (also called homogeneity of variance or homoscedasticity if you're a fancy-pants).
A more illustrative example of the violation of homoscedasticity (or heteroscedasticity) is shown in Figure 8.13:
The preceding plot depicts the characteristic funnel shape symptomatic of residual-fitted plots of offending regression models. Notice how on the left, the residuals vary very little, but the variances grow as you go along the x-axis.
Bear in mind that the residual-fitted plot need not resemble a funnel—any residual-fitted plot that very clearly shows the variance change as a function of the x-axis, violates this assumption.
Looking back on Anscombe's Quartet, you may think that the three relationships' unsuitability for linear modeling was obvious, and you may not immediately see the benefit of diagnostic plots. But before you write off the art (not science) of linear regression diagnostics, consider that these were all relationships with a single predictor. In multiple regression, with tens of predictors (or more), it is very difficult to diagnose problems by just plotting different cuts of the data. It is in this domain where linear regression diagnostics really shine.
Finally, the last hazard to be mindful of when linearly regressing is the problem of collinearity or multicollinearity. Collinearity occurs when two (or more) predictors are very highly correlated. This causes multiple problems for regression models, including highly uncertain and unstable coefficient estimates. An extreme example of this would be if we are trying to predict weight from height, and we had both height in feet and height in meters as predictors. In its most simple case, collinearity can be checked for by looking at the correlation matrix of all the regressors (using the cor
function); any cell that has a high correlation coefficient implicates two predictors that are highly correlated and, therefore, hold redundant information in the model. In theory, one of these predictors should be removed.
A more sneaky issue presents itself when there are no two individual predictors that are highly correlated, but there are multiple predictors that are collectively correlated. This is multicollinearity. This would occur to a small extent, for example, if instead of predicting mpg
from other variables in the mtcars
data set, we were trying to predict a (non-existent) new variable using mpg
and the other predictors. Since we know that mpg
can be fairly reliably estimated from some of the other variables in mtcars
, when it is a predictor in a regression modeling another variable, it would be difficult to tell whether the target's variance is truly explained by mpg
, or whether it is explained by mpg
's predictors.
The most common technique to detect multicollinearity is to calculate each predictor variable's Variance Inflation Factor (VIF). The VIF measures how much larger the variance of a coefficient is because of its collinearity. Mathematically, the VIF of a predictor, , is:
where is the of a linear model predicting from all other predictors ().
As such, the VIF has a lower bound of one (in the case that the predictor cannot be predicted accurately from the other predictors). Its upper bound is asymptotically infinite. In general, most view VIFs of more than four as cause for concern, and VIFs of 10 or above indicative of a very high degree of multicollinearity. You can calculate VIFs for a model, post hoc, with the vif
function from the car
package:
> model <- lm(mpg ~ am + wt + qsec, data=mtcars) > library(car) > vif(model) am wt qsec 2.541437 2.482952 1.364339
18.218.153.50