Fortunately, there are some robust methods for analyzing datasets, which are generally less sensitive to extreme values. These robust statistical methods have been developed since 1960, but there are some well-known related methods from even earlier, like using the median instead of the mean as a central tendency. Robust methods are often used when the underlying distribution of our data is not considered to follow the Gaussian curve, so most good old regression models do not work (see more details in the Chapter 5, Buildings Models (authored by Renata Nemeth and Gergely Toth) and the Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)).
Let's take the traditional linear regression example of predicting the sepal length of iris flowers based on the petal length with some missing data. For this, we will use the previously defined miris
dataset:
> summary(lm(Sepal.Length ~ Petal.Length, data = miris)) Call: lm(formula = Sepal.Length ~ Petal.Length, data = miris) Residuals: Min 1Q Median 3Q Max -1.26216 -0.36157 0.01461 0.35293 1.01933 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.27831 0.11721 36.50 <2e-16 *** Petal.Length 0.41863 0.02683 15.61 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4597 on 92 degrees of freedom (56 observations deleted due to missingness) Multiple R-squared: 0.7258, Adjusted R-squared: 0.7228 F-statistic: 243.5 on 1 and 92 DF, p-value: < 2.2e-16
So it seems that our estimate for the sepal and petal length ratio is around 0.42
, which is not too far from the real value by the way:
> lm(Sepal.Length ~ Petal.Length, data = iris)$coefficients (Intercept) Petal.Length 4.3066034 0.4089223
The difference between the estimated and real coefficients is due to the artificially introduced missing values in a previous section. Can we produce even better estimates? We might impute the missing data with any of the previously mentioned methods, or instead we should rather fit a robust linear regression from the MASS
package predicting Sepal.Length
with the Petal.Length
variable:
> library(MASS) > summary(rlm(Sepal.Length ~ Petal.Length, data = miris)) Call: rlm(formula = Sepal.Length ~ Petal.Length, data = miris) Residuals: Min 1Q Median 3Q Max -1.26184 -0.36098 0.01574 0.35253 1.02262 Coefficients: Value Std. Error t value (Intercept) 4.2739 0.1205 35.4801 Petal.Length 0.4195 0.0276 15.2167 Residual standard error: 0.5393 on 92 degrees of freedom (56 observations deleted due to missingness)
Now let's compare the coefficients of the models run against the original (full) and the simulated data (with missing values):
> f <- formula(Sepal.Length ~ Petal.Length) > cbind( + orig = lm(f, data = iris)$coefficients, + lm = lm(f, data = miris)$coefficients, + rlm = rlm(f, data = miris)$coefficients) orig lm rlm (Intercept) 4.3066034 4.2783066 4.2739350 Petal.Length 0.4089223 0.4186347 0.4195341
To be honest, there's not much difference between the standard linear regression and the robust version. Surprised? Well, the dataset included missing values completely at random, but what happens if the dataset includes other types of missing values or an outlier? Let's verify this by simulating some dirtier data issues (with updating the sepal length of the first observation from 1.4
to 14
– let's say due to a data input error) and rebuilding the models:
> miris$Sepal.Length[1] <- 14 > cbind( + orig = lm(f, data = iris)$coefficients, + lm = lm(f, data = miris)$coefficients, + rlm = rlm(f, data = miris)$coefficients) orig lm rlm (Intercept) 4.3066034 4.6873973 4.2989589 Petal.Length 0.4089223 0.3399485 0.4147676
It seems that the lm
model's performance decreased a lot, while the coefficients of the robust model are almost identical to the original model regardless of the outlier in the data. We can conclude that robust methods are pretty impressive and powerful tools when it comes to extreme values! For more information on the related methods already implemented in R, visit the related CRAN Task View at http://cran.r-project.org/web/views/Robust.html.
18.223.206.225