Using robust methods

Fortunately, there are some robust methods for analyzing datasets, which are generally less sensitive to extreme values. These robust statistical methods have been developed since 1960, but there are some well-known related methods from even earlier, like using the median instead of the mean as a central tendency. Robust methods are often used when the underlying distribution of our data is not considered to follow the Gaussian curve, so most good old regression models do not work (see more details in the Chapter 5, Buildings Models (authored by Renata Nemeth and Gergely Toth) and the Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)).

Let's take the traditional linear regression example of predicting the sepal length of iris flowers based on the petal length with some missing data. For this, we will use the previously defined miris dataset:

> summary(lm(Sepal.Length ~ Petal.Length, data = miris))

Call:
lm(formula = Sepal.Length ~ Petal.Length, data = miris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26216 -0.36157  0.01461  0.35293  1.01933 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.27831    0.11721   36.50   <2e-16 ***
Petal.Length  0.41863    0.02683   15.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4597 on 92 degrees of freedom
  (56 observations deleted due to missingness)
Multiple R-squared:  0.7258,  Adjusted R-squared:  0.7228 
F-statistic: 243.5 on 1 and 92 DF,  p-value: < 2.2e-16

So it seems that our estimate for the sepal and petal length ratio is around 0.42, which is not too far from the real value by the way:

> lm(Sepal.Length ~ Petal.Length, data = iris)$coefficients
 (Intercept) Petal.Length 
   4.3066034    0.4089223

The difference between the estimated and real coefficients is due to the artificially introduced missing values in a previous section. Can we produce even better estimates? We might impute the missing data with any of the previously mentioned methods, or instead we should rather fit a robust linear regression from the MASS package predicting Sepal.Length with the Petal.Length variable:

> library(MASS)
> summary(rlm(Sepal.Length ~ Petal.Length, data = miris))

Call: rlm(formula = Sepal.Length ~ Petal.Length, data = miris)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.26184 -0.36098  0.01574  0.35253  1.02262 

Coefficients:
             Value   Std. Error t value
(Intercept)   4.2739  0.1205    35.4801
Petal.Length  0.4195  0.0276    15.2167

Residual standard error: 0.5393 on 92 degrees of freedom
  (56 observations deleted due to missingness)

Now let's compare the coefficients of the models run against the original (full) and the simulated data (with missing values):

> f <- formula(Sepal.Length ~ Petal.Length)
> cbind(
+     orig =  lm(f, data = iris)$coefficients,
+     lm   =  lm(f, data = miris)$coefficients,
+     rlm  = rlm(f, data = miris)$coefficients)
                  orig        lm       rlm
(Intercept)  4.3066034 4.2783066 4.2739350
Petal.Length 0.4089223 0.4186347 0.4195341

To be honest, there's not much difference between the standard linear regression and the robust version. Surprised? Well, the dataset included missing values completely at random, but what happens if the dataset includes other types of missing values or an outlier? Let's verify this by simulating some dirtier data issues (with updating the sepal length of the first observation from 1.4 to 14 – let's say due to a data input error) and rebuilding the models:

> miris$Sepal.Length[1] <- 14
> cbind(
+     orig = lm(f, data = iris)$coefficients,
+     lm   = lm(f, data = miris)$coefficients,
+     rlm  = rlm(f, data = miris)$coefficients)
                  orig        lm       rlm
(Intercept)  4.3066034 4.6873973 4.2989589
Petal.Length 0.4089223 0.3399485 0.4147676

It seems that the lm model's performance decreased a lot, while the coefficients of the robust model are almost identical to the original model regardless of the outlier in the data. We can conclude that robust methods are pretty impressive and powerful tools when it comes to extreme values! For more information on the related methods already implemented in R, visit the related CRAN Task View at http://cran.r-project.org/web/views/Robust.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.206.225