Outlier detection

Besides forecasting, another time-series related major task is identifying suspicious or abnormal data in a series of observations that might distort the results of our analysis. One way to do so is to build an ARIMA model and analyze the distance between the predicted and actual values. The tsoutliers package provides a very convenient way to do so. Let's build a model on the number of cancelled flights in 2011:

> cts <- ts(daily$Cancelled)
> fit <- auto.arima(cts)
> auto.arima(cts)
Series: ts 
ARIMA(1,1,2)

Coefficients:
          ar1      ma1      ma2
      -0.2601  -0.1787  -0.7752
s.e.   0.0969   0.0746   0.0640

sigma^2 estimated as 539.8:  log likelihood=-1662.95
AIC=3333.9   AICc=3334.01   BIC=3349.49

So now we can use an ARIMA(1,1,2) model and the tso function to highlight (and optionally remove) the outliers from our dataset:

Tip

Please note that the following tso call can run for several minutes with a full load on a CPU core as it may be performing heavy computations in the background.

> library(tsoutliers)
> outliers <- tso(cts, tsmethod = 'arima',
+   args.tsmethod  = list(order = c(1, 1, 2)))
> plot(outliers)
Outlier detection

Alternatively, we can run all the preceding steps in one go by automatically calling auto.arima inside tso without specifying any extra arguments besides the time-series object:

> plot(tso(ts(daily$Cancelled)))

Anyway, the results show that all observations with a high number of cancelled flights are outliers and so should be removed from the dataset. Well, considering any day with many cancelled flights as outlier sounds really optimistic! But this is very useful information; it suggests that, for example, forecasting an outlier event is not manageable with the previously discussed methods.

Traditionally, time-series analysis deals with trends and seasonality of data, and how to stationarize the time-series. If we are interested in deviations from normal events, some other methods need to be used.

Twitter recently published one of its R packages to detect anomalies in time-series. Now we will use its AnomalyDetection package to identify the preceding outliers in a much faster way. As you may have noticed, the tso function was really slow to run, and it cannot really handle large amount of data – while the AnomalyDetection package performs pretty well.

We can provide the input data as a vector of a data.frame with the first column storing the timestamps. Unfortunately, the AnomalyDetectionTs function does not really work well with data.table objects, so let's revert to the traditional data.frame class:

> dfc <- as.data.frame(daily[, c('date', 'Cancelled'), with = FALSE])

Now let's load the package and plot the anomalies identified among the observations:

> library(AnomalyDetection)
> AnomalyDetectionTs(dfc, plot = TRUE)$plot
Outlier detection

The results are very similar to the previous plots, but there are two things to note that you might have already noticed. The computation was extremely quick and, on the other hand, this plot includes human-friendly dates instead of some lame indexes on the x axis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.21.158