Linear regression diagnostics

I would be negligent if I failed to mention the boring but very critical topic of the assumptions of linear models, and how to detect violations of those assumptions. Just like the assumptions of the hypothesis tests in Chapter 6, Testing Hypotheses linear regression has its own set of assumptions, the violation of which jeopardize the accuracy of our model—and any inferences derived from it—to varying degrees. The checks and tests that ensure these assumptions are met are called diagnostics.

There are five major assumptions of linear regression:

  • That the errors (residuals) are normally distributed with a mean of 0
  • That the error terms are uncorrelated
  • That the errors have a constant variance
  • That the effect of the independent variables on the dependent variable are linear and additive
  • That multi-collinearity is at a minimum

We'll briefly touch on these assumptions, and how to check for them in this section here. To do this, we will be using a residual-fitted plot, since it allows us, with some skill, to verify most of these assumptions. To view a residual-fitted plot, just call the plot function on your linear model object:

  > my.model <- lm(mpg ~ wt, data=mtcars)
  > plot(my.model)

This will show you a series of four diagnostic plots—the residual-fitted plot is the first. You can also opt to view just the residual-fitted plot with this related incantation:

  > plot(my.model, which=1)

We are also going back to Anscombe's Quartet, since the quartet's aberrant relationships collectively illustrate the problems that you might find with fitting regression models and assumption violation. To re-familiarize yourself with the quartet, look back to Figure 8.6.

Second Anscombe relationship

The first relationship in Anscombe's Quartet (y1 ~ x1) is the only one that can appropriately be modeled with linear regression as is. In contrast, the second relationship (y2 ~ x2) depicts a relationship that violates the requirement of a linear relationship. It also subtly violates the assumption of normally distributed residuals with a mean of zero. To see why, refer to Figure 8.11, which depicts its residual-fitted plot:

Second Anscombe relationship

Figure 8.11: The top two panels show the first and second relationships of Anscombe's quartet, respectively. The bottom two panels depict each top panel's respective residual-fitted plot

A non-pathological residual-fitted plot will have data points randomly distributed along the invisible horizontal line, where the y-axis equals 0. By default, this plot also contains a smooth curve that attempts to fit the residuals. In a non-pathological sample, this smooth curve should be approximately straight, and straddle the line at y=0.

As you can see, the first Anscombe relationship does this well. In contrast, the smooth curve of the second relationship is a parabola. These residuals could have been drawn from a normal distribution with a mean of zero, but it is highly unlikely. Instead, it looks like these residuals were drawn from a distribution—perhaps from a normal distribution—whose mean changed as a function of the x-axis. Specifically, it appears as if the residuals at the two ends were drawn from a distribution whose mean was negative, and the middle residuals had a positive mean.

Third Anscombe relationship

We already dug deeper into this relationship when we spoke of robust regression earlier in the chapter. We saw that a robust fit of this relationship more of less ignored the clear outlier. Indeed, the robust fit is almost identical to the non-robust linear fit after the outlier is removed.

On occasion, a data point that is an outlier in the y-axis but not the x-axis (like this one) doesn't influence the regression line much—meaning that its omission wouldn't cause a substantial change in the estimated intercept and coefficients.

A data point that is an outlier in the x-axis (or axes) is said to have high leverage. Sometimes, points with high leverage don't influence the regression line much, either. However, data points that have high leverage and are outliers very often exert high influence on the regression fit, and must be handled appropriately.

Refer to the upper-right panel of Figure 8.12. The aberrant data point in the fourth relationship of Anscombe's quartet has very high leverage and high influence. Note that the slope of the regression line is completely determined by the y-position of that point.

Fourth Anscombe relationship

The following image depicts some of the linear regression diagnostic plots of the fourth Anscombe relationship:

Fourth Anscombe relationship

Figure 8.12: The first and the fourth Anscombe relationships and their respective residual-fitted plots

Although it's difficult to say for sure, this is probably in violation of the assumption of constant variance of residuals (also called homogeneity of variance or homoscedasticity if you're a fancy-pants).

A more illustrative example of the violation of homoscedasticity (or heteroscedasticity) is shown in Figure 8.13:

Fourth Anscombe relationship

Figure 8.13: A paradigmatic depiction of the residual-fitted plot of a regression model for which the assumption of homogeneity of variance is violated

The preceding plot depicts the characteristic funnel shape symptomatic of residual-fitted plots of offending regression models. Notice how on the left, the residuals vary very little, but the variances grow as you go along the x-axis.

Bear in mind that the residual-fitted plot need not resemble a funnel—any residual-fitted plot that very clearly shows the variance change as a function of the x-axis, violates this assumption.

Looking back on Anscombe's Quartet, you may think that the three relationships' unsuitability for linear modeling was obvious, and you may not immediately see the benefit of diagnostic plots. But before you write off the art (not science) of linear regression diagnostics, consider that these were all relationships with a single predictor. In multiple regression, with tens of predictors (or more), it is very difficult to diagnose problems by just plotting different cuts of the data. It is in this domain where linear regression diagnostics really shine.

Finally, the last hazard to be mindful of when linearly regressing is the problem of collinearity or multicollinearity. Collinearity occurs when two (or more) predictors are very highly correlated. This causes multiple problems for regression models, including highly uncertain and unstable coefficient estimates. An extreme example of this would be if we are trying to predict weight from height, and we had both height in feet and height in meters as predictors. In its most simple case, collinearity can be checked for by looking at the correlation matrix of all the regressors (using the cor function); any cell that has a high correlation coefficient implicates two predictors that are highly correlated and, therefore, hold redundant information in the model. In theory, one of these predictors should be removed.

A more sneaky issue presents itself when there are no two individual predictors that are highly correlated, but there are multiple predictors that are collectively correlated. This is multicollinearity. This would occur to a small extent, for example, if instead of predicting mpg from other variables in the mtcars data set, we were trying to predict a (non-existent) new variable using mpg and the other predictors. Since we know that mpg can be fairly reliably estimated from some of the other variables in mtcars, when it is a predictor in a regression modeling another variable, it would be difficult to tell whether the target's variance is truly explained by mpg, or whether it is explained by mpg's predictors.

The most common technique to detect multicollinearity is to calculate each predictor variable's Variance Inflation Factor (VIF). The VIF measures how much larger the variance of a coefficient is because of its collinearity. Mathematically, the VIF of a predictor, Fourth Anscombe relationship, is:

Fourth Anscombe relationship

where Fourth Anscombe relationship is the Fourth Anscombe relationship of a linear model predicting Fourth Anscombe relationship from all other predictors (Fourth Anscombe relationship).

As such, the VIF has a lower bound of one (in the case that the predictor cannot be predicted accurately from the other predictors). Its upper bound is asymptotically infinite. In general, most view VIFs of more than four as cause for concern, and VIFs of 10 or above indicative of a very high degree of multicollinearity. You can calculate VIFs for a model, post hoc, with the vif function from the car package:

  > model <- lm(mpg ~ am + wt + qsec, data=mtcars)
  > library(car)
  > vif(model)
        am       wt     qsec 
  2.541437 2.482952 1.364339
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.153.50