An outlier or extreme value is defined as a data point that deviates so far from the other observations, that it becomes suspicious to be generated by a totally different mechanism or simply by error. Identifying outliers is important because those extreme values can:
Or in other words, let's say your raw dataset is a piece of rounded stone to be used as a perfect ball in some game, which has to be cleaned and polished before actually using it. The stone has some small holes on its surface, like missing values in the data, which should be filled – with data imputation.
On the other hand, the stone does not only has holes on its surface, but some mud also covers some parts of the item, which is to be removed. But how can we distinguish mud from the real stone? In this section, we will focus on what the outliers
package and some related methods have to offer for identifying extreme values.
As this package has some conflicting function names with the randomForest
package (automatically loaded by the missForest
package), it's wise to detach the latter before heading to the following examples:
> detach('package:missForest') > detach('package:randomForest')
The outlier
function returns the value with the largest difference from the mean, which, contrary to its name, not necessarily have to be an outlier. Instead, the function can be used to give the analyst an idea about which values can be outliers:
> library(outliers) > outlier(hflights$DepDelay) [1] 981
So there was a flight with more than 16 hours of delay before actually taking off! This is impressive, isn't it? Let's see if it's normal to be so late:
> summary(hflights$DepDelay) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -33.000 -3.000 0.000 9.445 9.000 981.000 2905
Well, mean
is around 10 minutes, but as it's even larger than the third quarter and the median
is zero, it's not that hard to guess that the relatively large mean is due to some extreme values:
> library(lattice) > bwplot(hflights$DepDelay)
The preceding boxplot clearly shows that most flights were delayed by only a few minutes, and the interquartile range is around 10 minutes:
> IQR(hflights$DepDelay, na.rm = TRUE) [1] 12
All the blue circles in the preceding image are the whiskers are possible extreme values, as being higher than the 1.5 IQR of the upper quartile. But how can we (statistically) test a value?
The outliers
package comes with several bundled extreme value detection algorithms, like:
dixon.test
)grubbs.test
)cochran.test
)chisq.out.test
)These functions are extremely easy to use. Just pass a vector to the statistical tests and the returning p-value of the significance test will clearly indicate if the data has any outliers. For example, let's test 10 random numbers between 0 and 1 against a relatively large number to verify it's an extreme value in this small sample:
> set.seed(83) > dixon.test(c(runif(10), pi)) Dixon test for outliers data: c(runif(10), pi) Q = 0.7795, p-value < 2.2e-16 alternative hypothesis: highest value 3.14159265358979 is an outlier
But unfortunately, we cannot use these convenient functions in our live dataset, as the methods assume normal distribution, which is definitely not true in our cases as we all know from experience: flights tend to be late more often than arriving a lot sooner to their destinations.
For this, we should use some more robust methods, such as the
mvoutlier
package, or some very simple approaches like Lund suggested around 40 years ago. This test basically computes the distance of each value from the mean with the help of a very simple linear regression:
> model <- lm(hflights$DepDelay ~ 1)
Just to verify we are now indeed measuring the distance from the mean:
> model$coefficients (Intercept) 9.444951 > mean(hflights$DepDelay, na.rm = TRUE) [1] 9.444951
Now let's compute the critical value based on the F distribution and two helper variables (where a
stands for the alpha value and n
represents the number of cases):
> a <- 0.1 > (n <- length(hflights$DepDelay)) [1] 227496 > (F <- qf(1 - (a/n), 1, n-2, lower.tail = TRUE)) [1] 25.5138
Which can be passed to Lund's formula:
> (L <- ((n - 1) * F / (n - 2 + F))^0.5) [1] 5.050847
Now let's see how many values have a higher standardized residual than this computed critical value:
> sum(abs(rstandard(model)) > L) [1] 1684
But do we really have to remove these outliers from our data? Aren't extreme values normal? Sometimes these artificial edits in the raw data, like imputing missing values or removing outliers, makes more trouble than it's worth.
18.191.215.96