How to do it...

In this recipe, we will work with a dataset containing house prices. The intention will be to identify influential observations:

We first load our dataset, and formulate our model:

library(car) 
data = read.csv("./house_prices_aug.csv") 
model = lm(Property_price ~ size + number.bathrooms + number.bedrooms +number.entrances +size_balcony  
+size_entrance,data=data)

We can build a simple plot to identify influential observations. The X axis represents the leverage, while the Y axis represents the residual size. A quick rule is to flag observations as influential if Cook's D is greater than 1. R creates two curves, one for Cook's D = 0.5 and another one for Cook's D = 1. Since we usually focus on observations with Cook's D > 1 we want to find the points outside the curve that delimits Cook's D = 1. Observation = 408 has a very large Cook's D, whereas observation = 1 has a low Cook's D, even though it has a huge leverage:

plot(model)

The following screenshot shows the residuals versus Leverage plot (observation 408 has a very large Cook's D):

With the previous plot, we can't really see if the high leverage is caused by a specific variable. If we use the leveragePlots function, we can get that easily. This function plots the partial regression leverage plots. On the Y axis, it has the dependent variable residuals, obtained by regressing it in terms of all the variables except for the selected regressor. On the X axis, it has the selected regressor residuals obtained by regressing it in terms of all the other regressors. We can see that observation 408 has a large leverage mostly because of size_balcony, size_entrance, and size (this is measured along the x axis). Observation = 1 has an enormous leverage (because it has a large size variable), but because it almost falls over the regression line it has a low residual; because the residual is low, its Cook's D is small:

leveragePlots(model)

The following screenshot shows the leverage plots:

We can test if we have outliers using the following code. This tests whether each residual is an outlier, doing an appropriate Bonferroni correction (bear in mind that if we do multiple tests, with, for instance, an alpha of 0.05, the effective alpha for test is no longer 0.05; therefore, it needs to be corrected). Because we only have one problematic residual, the Bonferroni correction is not relevant here. We get a small p-value, so we reject the null hypothesis that there are no outliers. It's worth noting that observation=1 has a low residual (but a high leverage), so it doesn't appear here. Remember that the presence (or not) of outliers is not an indication that our model is wrong:

outlierTest(model)

The following screenshot shows outlierTest results:

We can check the leverage with more detail. Obviously, observations 1 and 408 appear here as well:

plot(hatvalues(model), type = "h")

The following screenshot shows the another leverage plot:

Let's direct our attention towards Cook's D value. We will print all the values with a Cook's D larger than 4/(n-coeffs-1). As we know, we have only one case that calls our attention (observation = 408):

cooksd <- sort(cooks.distance(model)) 
cutoff <- 4/((nrow(data)-length(model$coefficients)-1))  
plot(model, which=4, cook.levels=cutoff)

The following screenshot shows the Cook's D plot:

The recommended approach here is to remove observation 408, since it is disrupting our results. Once we remove it, the coefficients will change dramatically for the variables causing the large leverage (size_balcony and size_entrance):

model2 = lm(Property_price ~ size + number.bathrooms + number.bedrooms +number.entrances   
+size_balcony  +size_entrance,data=data[-c(408),])  
model 
model2

The following screenshot shows standard results (top), and corrected data results (bottom):

Table of Contents for How&#xA0;to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...