Chapter 12. Analyzing Time-series

A time-series is a sequence of data points ordered in time, often used in economics or, for example, in social sciences. The great advantage of collecting data over a long period of time compared to cross-sectional observations is that we can analyze the collected values of the exact same object over time instead of comparing different observations.

This special characteristic of the data requires new methods and data structures for time-series analysis. We will cover these in this chapter:

  • First, we learn how to load or transform observations into time-series objects
  • Then we visualize them and try to improve the plots by smoothing and filtering the observations
  • Besides seasonal decomposition, we introduce forecasting methods based on time-series models, and we also cover methods to identify outliers, extreme values, and anomalies in time-series

Creating time-series objects

Most tutorials on time-series analysis start with the ts function of the stats package, which can create time-series objects in a very straightforward way. Simply pass a vector or matrix of numeric values (time-series analysis mostly deals with continuous variables), specify the frequency of your data, and it's all set!

The frequency refers to the natural time-span of the data. Thus, for monthly data, you should set it to 12, 4 for quarterly and 365 or 7 for daily data, depending on the most characteristic seasonality of the events. For example, if your data has a significant weekly seasonality, which is pretty usual in social sciences, it should be 7, but if the calendar date is the main differentiator, such as with weather data, it should be 365.

In the forthcoming pages, let's use daily summary statistics from the hflights dataset. First let's load the related dataset and transform it to data.table for easy aggregation. We also have to create a date variable from the provided Year, Month, and DayofMonth columns:

> library(hflights)
> library(data.table)
> dt <- data.table(hflights)
> dt[, date := ISOdate(Year, Month, DayofMonth)]

Now let's compute the number of flights and the overall sum of arrival delays, number of cancelled flights and the average distance of the related flights for each day in 2011:

> daily <- dt[, list(
+     N         = .N,
+     Delays    = sum(ArrDelay, na.rm = TRUE),
+     Cancelled = sum(Cancelled),
+     Distance  = mean(Distance)
+ ), by = date]
> str(daily)
Classes 'data.table' and 'data.frame':	365 obs. of  5 variables:
 $ date     : POSIXct, format: "2011-01-01 12:00:00" ...
 $ N        : int  552 678 702 583 590 660 661 500 602 659 ...
 $ Delays   : int  5507 7010 4221 4631 2441 3994 2571 1532 ...
 $ Cancelled: int  4 11 2 2 3 0 2 1 21 38 ...
 $ Distance : num  827 787 772 755 760 ...
 - attr(*, ".internal.selfref")=<externalptr>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.79.176