Problems with linear regression

In this chapter, we've already seen some examples where trying to build a linear regression model might run into problems. One big class of problems that we've talked about is related to our model assumptions of linearity, feature independence, and the homoscedasticity and normality of errors. In particular we saw methods of diagnosing these problems either via plots, such as the residual plot, or by using functions that identify dependent components. In this section, we'll investigate a few more issues that can arise with linear regression.

Multicollinearity

As part of our preprocessing steps, we were diligent to remove features that were linearly related to each other. In doing this we were looking for an exact linear relationship and this is an example of perfect collinearity. Collinearity is the property that describes when two features are approximately in a linear relationship. This creates a problem for linear regression as we are trying to assign separate coefficients to variables that are almost linear functions of each other. This can result in a situation where the coefficients of two highly collinear features have high p-values that indicate they are not related to the output variable, but if we remove one of these and retrain a model, the one left in has a low p-value. Another classic indication of collinearity is an unusual sign on one of the coefficients; for example a negative coefficient on educational background for a linear model that predicts income. Collinearity between two features can be detected through pairwise correlation. One way to deal with collinearity is to combine two features into a new one (e.g. by averaging); another is by simply discarding one of the features.

Multicollinearity occurs when the linear relationship involves more than two features. A standard method for detecting this is to calculate the variance inflation factor (VIF) for every input feature in a linear model. In a nutshell, the VIF tries to estimate the increase in variance that is observed in the estimate of a particular coefficient that is a direct result of that feature being collinear with other features. This is typically done by fitting a linear regression model in which we treat one of the features as the output feature and the remaining features as regular input features. We then compute the R2 statistic for this linear model and from this, the VIF for our chosen feature using the formula 1 / (1 – R2). In R, the car package contains the vif() function, which conveniently calculates the VIF value for every feature in a linear regression model. A rule of thumb here is that the VIF score of 4 or more for a feature is suspect, and a score in excess of 10 indicates strong likelihood of multicollinearity. Since we saw that our cars data had linearly dependent features that we had to remove, let's investigate whether we have multicollinearity in those that remain:

> library("car")
> vif(cars_model2)
    Mileage    Cylinder       Doors      Cruise       Sound 
   1.010779    2.305737    4.663813    1.527898    1.137607 
    Leather       Buick    Cadillac       Chevy     Pontiac 
   1.205977    2.464238    3.158473    4.138318    3.201605 
       Saab convertible   hatchback       sedan 
   3.515018    1.620590    2.481131    4.550556

We are seeing three values here that are slightly above 4 but no values above that. As an example, the following code shows how the VIF value for sedan was calculated:

> sedan_model <- lm(sedan ~ .-Price -Saturn, data = cars_train)
> sedan_R2 <- compute_rsquared(sedan_model$fitted.values, cars_train$sedan)
> 1 / (1-sedan_R2)
[1] 4.550556

Outliers

When we looked at the residuals of our two models, we saw that there were certain observations that had a significantly higher residual than others. For example, referring to the residual plot for the CPU model, we can see that the observation 200 has a very high residual. This is an example of an outlier, an observation whose predicted value is very far from its actual value. Due to the squaring of residuals, outliers tend to have a significant impact on the RSS, giving us a sense that we don't have a good model fit. Outliers can occur due to measurement errors and detecting them may be important, as they may signify data that is inaccurate or invalid. On the other hand, outliers may simply be the result of not having the right features or building the wrong kind of model.

As we generally won't know whether an outlier is an error or a genuine observation during data collection, handling outliers can be very tricky. Sometimes, especially when we have very few outliers, a common recourse is to remove them, because including them frequently has the effect of changing the predicted model coefficients significantly. We say that outliers are often points with high influence.

Note

Outliers are not the only observations that can have high influence. High leverage points are observations that have an extreme value for at least one of their features and thus, lie far away from most of the other observations. Cook's distance is a typical metric that combines the notions of outlier and high leverage to identify points that have high influence on the data. For a more thorough exploration of linear regression diagnostics, a wonderful reference is An R Companion to Applied Regression, John Fox, Sage Publications.

To illustrate the effect of removing an outlier, we will create a new CPU model by using our training data without observation number 200. Then, we will see whether our model has an improved fit on the training data. Here, we've shown the steps taken and a truncated model summary with only the final three lines:

> machine_model2 <- lm(PRP ~ ., data = machine_ train[!(rownames(machine_train)) %in% c(200),])
> summary(machine_model2)
...
Residual standard error: 51.37 on 171 degrees of freedom
Multiple R-squared:  0.8884,	Adjusted R-squared:  0.8844
F-statistic: 226.8 on 6 and 171 DF,  p-value: < 2.2e-16

As we can see from the reduced RSE and improved R2, we have a better fit on our training data. Of course, the real measure of model accuracy is the performance on the test data, and there are no guarantees that our decision to label observation 200 as a spurious outlier was the right one.

> machine_model2_predictions <- predict(machine_model2, 
                                        machine_test)
> compute_mse(machine_model2_predictions, machine_test$PRP)
[1] 2555.355

We have a lower test MSE than before, which is usually a good sign that we made the right choice. Again, because we have a small test set, we cannot be certain of this fact despite the positive indication from the MSE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.44.182