Extreme values and outliers

An outlier or extreme value is defined as a data point that deviates so far from the other observations, that it becomes suspicious to be generated by a totally different mechanism or simply by error. Identifying outliers is important because those extreme values can:

  • Increase error variance
  • Influence estimates
  • Decrease normality

Or in other words, let's say your raw dataset is a piece of rounded stone to be used as a perfect ball in some game, which has to be cleaned and polished before actually using it. The stone has some small holes on its surface, like missing values in the data, which should be filled – with data imputation.

On the other hand, the stone does not only has holes on its surface, but some mud also covers some parts of the item, which is to be removed. But how can we distinguish mud from the real stone? In this section, we will focus on what the outliers package and some related methods have to offer for identifying extreme values.

As this package has some conflicting function names with the randomForest package (automatically loaded by the missForest package), it's wise to detach the latter before heading to the following examples:

> detach('package:missForest')
> detach('package:randomForest')

The outlier function returns the value with the largest difference from the mean, which, contrary to its name, not necessarily have to be an outlier. Instead, the function can be used to give the analyst an idea about which values can be outliers:

> library(outliers)
> outlier(hflights$DepDelay)
[1] 981

So there was a flight with more than 16 hours of delay before actually taking off! This is impressive, isn't it? Let's see if it's normal to be so late:

> summary(hflights$DepDelay)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-33.000  -3.000   0.000   9.445   9.000 981.000    2905

Well, mean is around 10 minutes, but as it's even larger than the third quarter and the median is zero, it's not that hard to guess that the relatively large mean is due to some extreme values:

> library(lattice)
> bwplot(hflights$DepDelay)
Extreme values and outliers

The preceding boxplot clearly shows that most flights were delayed by only a few minutes, and the interquartile range is around 10 minutes:

> IQR(hflights$DepDelay, na.rm = TRUE)
[1] 12

All the blue circles in the preceding image are the whiskers are possible extreme values, as being higher than the 1.5 IQR of the upper quartile. But how can we (statistically) test a value?

Testing extreme values

The outliers package comes with several bundled extreme value detection algorithms, like:

  • Dixon's Q test (dixon.test)
  • Grubb's test (grubbs.test)
  • Outlying and inlying variance (cochran.test)
  • Chi-squared test (chisq.out.test)

These functions are extremely easy to use. Just pass a vector to the statistical tests and the returning p-value of the significance test will clearly indicate if the data has any outliers. For example, let's test 10 random numbers between 0 and 1 against a relatively large number to verify it's an extreme value in this small sample:

> set.seed(83)
> dixon.test(c(runif(10), pi))

  Dixon test for outliers

data:  c(runif(10), pi)
Q = 0.7795, p-value < 2.2e-16
alternative hypothesis: highest value 3.14159265358979 is an outlier

But unfortunately, we cannot use these convenient functions in our live dataset, as the methods assume normal distribution, which is definitely not true in our cases as we all know from experience: flights tend to be late more often than arriving a lot sooner to their destinations.

For this, we should use some more robust methods, such as the mvoutlier package, or some very simple approaches like Lund suggested around 40 years ago. This test basically computes the distance of each value from the mean with the help of a very simple linear regression:

> model <- lm(hflights$DepDelay ~ 1)

Just to verify we are now indeed measuring the distance from the mean:

> model$coefficients
(Intercept) 
   9.444951 
> mean(hflights$DepDelay, na.rm = TRUE)
[1] 9.444951

Now let's compute the critical value based on the F distribution and two helper variables (where a stands for the alpha value and n represents the number of cases):

> a <- 0.1
> (n <- length(hflights$DepDelay))
[1] 227496
> (F <- qf(1 - (a/n), 1, n-2, lower.tail = TRUE))
[1] 25.5138

Which can be passed to Lund's formula:

> (L <- ((n - 1) * F / (n - 2 + F))^0.5)
[1] 5.050847

Now let's see how many values have a higher standardized residual than this computed critical value:

> sum(abs(rstandard(model)) > L)
[1] 1684

But do we really have to remove these outliers from our data? Aren't extreme values normal? Sometimes these artificial edits in the raw data, like imputing missing values or removing outliers, makes more trouble than it's worth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.215.96