Working with dates and time series

In this section, we'll cover a concept closely related to vectors—time series. A time series is a sequence of values, each associated with a time index. For convenience, the values are usually ordered from the earliest to latest. The time difference between consecutive time indices can be fixed (in which case we have a regular time series) or variable (in which case we have an irregular time series), although an irregular time series can also be considered as a regular time series with missing data. For example, daily rainfall amounts in New York or Dollar to Euro currency exchange rates for the period of January 1, 2014 to January 15, 2014 would comprise two different time series.

Following its definition, the simplest way to represent a time series would be to have a separate vector of data values and a separate vector of time, with the same length, with each element of the data values vector corresponding to the respective element in the time vector. The only thing you need to learn in order to do this in R is to represent time, which is the topic of the present section.

Specialized time series classes in R

Several special classes to represent time series exist in R. Basically, such classes encompass the time and data values parts of a time series within a single object. For example, ts, zoo, and xts are different time series classes in R. The ts class is defined in the base packages, whereas the zoo and xts classes are defined in the contributed packages of the same respective names. The concept of working with packages in R will be introduced in the next chapter.

Working with time series objects has certain advantages such as having the ability to use specialized functions (for example, linear or spline interpolation of missing values in a time series using a single function call) or making sure that every object satisfies the class rules (for example, the number of data values and time indices in a time series must be equal). For the purposes of this book, we will stick to the basic manual representation of a time series. This way, we will have a chance to gain a better understanding of R's general principles, while the next step towards specialized time series classes would be easily executed by interested readers. There are numerous resources devoted to the time series analysis with R; for example, Paul S.P. Cowpertwait and Andrew V. Metcalfe in their book Introductory Time Series with R, Springer, (2009), provide an excellent applied introduction on this subject.

Reading climatic data from a CSV file

You are now going to learn how to use dates in R using our very first real-world example. We are going to use the comma separated values (CSV) file named 338284.csv, which was downloaded from the National Oceanic and Atmospheric Administration (NOAA) National Climatic Data Center. This file contains daily rainfall and temperature data from a meteorological station at the Albuquerque International Airport, New Mexico, from March 1, 1931 to May 15, 2014.

A CSV file is used to store plain tabular data with no additional features that are common in spreadsheet files such as XLS. This is how the file looks when opened in Excel:

Reading climatic data from a CSV file

The following three lines of code read the file into R and assign the values in the DATE and TMAX columns to two separate vectors named time (since the data in the DATE column represents time) and tmax (which stands for maximum temperature). This involves operations on tables, which will be explained in the next chapter. They are provided here only for completeness:

> dat = read.csv("C:\Data\338284.csv", stringsAsFactors = FALSE)
> time = dat$DATE
> tmax = dat$TMAX

The important point is that we now have two vectors to work with, time and tmax, as an exercise summarizing most of the topics we dealt with in this chapter.

Converting character values to dates

Dates can be represented in R (as in many other types of software) using a special format. This allows certain special operations (such as finding the time difference between two dates) to be performed, which is not possible when dates are represented by simply using characters. There are several classes for date and time data in R. The simplest class (and the only one we will use in this book) is called Date, and it is used to represent calendar dates. Other classes exist to represent longer intervals of time (for example, monthly) or shorter (for example, date plus the time of day) intervals.

Note

Note that the Date and factor objects are not vectors in R terminology since they have additional attributes not present in the vector class. However, from the user's perspective, working with them often follows the same principles as seen in vectors. For example, creating subsets of Date objects works the same way as creating subsets of vectors.

For example, the Sys.Date and Sys.time functions return the current date or date plus the time of day, respectively. The object returned by Sys.Date belongs to class Date, while the object returned by Sys.time is an object of a different class (POSIXct). Let's take a look at the following examples:

> x = Sys.Date()
> x
[1] "2014-05-22"
> class(x)
[1] "Date"
> y = Sys.time()
> y
[1] "2014-05-22 10:04:56 IDT"
> class(y)
[1] "POSIXct" "POSIXt"

As we can see in the first half of the previous example, a Date object is printed the same way as a character vector holding the value "2014-05-22" would. However, as already mentioned, we can conduct calculations involving time intervals with the Date class, which make it worthwhile to represent dates in such a specialized format. For example, we can tell what date it will be seven days from today or what the date was 1,000 days ago:

> x + 7
[1] "2014-05-29"
> x - 1000
[1] "2011-08-26"

We can switch between the character vector and Date classes, using the as.character and as.Date functions. For example, we can convert our Date object x to a character vector using as.character:

> x = as.character(x)
> x
[1] "2014-05-22"
> class(x)
[1] "character"

We can convert the character vector back to Date using as.Date:

> x = as.Date(x)
> x
[1] "2014-05-22"
> class(x)
[1] "Date"

We can create a sequence of consecutive dates using seq, since this function accepts Date objects as well:

> seq(from = as.Date("2013-01-01"), 
+ to = as.Date("2013-02-01"), 
+ by = 3)
 [1] "2013-01-01" "2013-01-04" "2013-01-07" "2013-01-10"
 [5] "2013-01-13" "2013-01-16" "2013-01-19" "2013-01-22"
 [9] "2013-01-25" "2013-01-28" "2013-01-31"

This gives us consecutive dates separated by three days from each other, from January 1, 2013 to February 1, 2013.

The latter conversions, from character to date, were made possible so easily since the "2014-05-22" configuration is a default one. This way, the as.Date function knew that the first four characters in "2014-05-22" represent the year, the next two characters (following a hyphen) represent the month, and the last two characters represent the day. When we have characters representing a date in a different configuration, we need to use the format parameter of as.Date, where we specify the encoding types of the elements, their order, and the characters separating them (if any).

The common encoding types of the year, month, and day elements, and their respective symbols in R, are summarized in the following table:

Symbol

Meaning

%d

Day (for example, 15)

%m

Months in number (for example, 08)

%b

The first three characters of a month (for example, Aug)

%B

The full name of a month (for example, August)

%y

The last two digits of a year (for example, 14)

%Y

The full year (for example, 2014)

Using this symbology, along with the format parameter of the as.Date function, we can convert character values of other formats to dates. Let's take a look at the following examples:

> as.Date("07/Aug/12")
Error in charToDate(x) : 
  character string is not in a standard unambiguous format
> as.Date("07/Aug/12", format = "%d/%b/%y")
[1] "2012-08-07"
> as.Date("2012-August-07")
Error in charToDate(x) : 
  character string is not in a standard unambiguous format
> as.Date("2012-August-07", format = "%Y-%B-%d")
[1] "2012-08-07"

In each of these two example pairs, the first expression resulted in an error since we were trying to convert a character value of a non-standard date format to a Date without specifying the format, while the second expression worked since we did specify the format.

Once we have a Date object, we can extract one or two (or all) of its three elements (year, month, and day), and encode them as we wish using the format function, specifying the required format the same way as shown earlier. Note that the results are no longer Date objects, but character vectors:

> d = as.Date("1955-11-30")
> d
[1] "1955-11-30"
> format(d, "%d")
[1] "30"
> format(d, "%B")
[1] "November"
> format(d, "%Y")
[1] "1955"
> format(d, "%m/%Y")
[1] "11/1955"

We are now ready to proceed with our example involving the time and tmax vectors. First, we can find out that both vectors are numeric (integers, numbers without a fractional component, to be precise) as follows:

> class(time)
[1] "integer"
> class(tmax)
[1] "integer"

Then, let's see what the values of these vectors look like by printing the first 10 values from each one of them:

> time[1:10]
 [1] 19310301 19310302 19310303 19310304 19310305 19310306
 [7] 19310307 19310308 19310309 19310310
> tmax[1:10]
 [1]  72 133 178 183 111  67  78  83 139 156

The time vector contains dates in the %Y%m%d configuration (year, month, and day indicated by full numeric values, without separating characters). Therefore, we can convert it to a Date object, as follows:

> time = as.Date(as.character(time), format = "%Y%m%d")
> time[1:10]
 [1] "1931-03-01" "1931-03-02" "1931-03-03" "1931-03-04"
 [5] "1931-03-05" "1931-03-06" "1931-03-07" "1931-03-08"
 [9] "1931-03-09" "1931-03-10"
> class(time)
[1] "Date"

Note that we first needed to convert the time vector from numeric to character since the as.Date function works on character vectors. Now that time is a vector of dates, we have more freedom to treat the data as a time series.

Examining our time series

Looking into the documentation on climatic data from NOAA (which is also provided on the book's website), we can see that the temperature is provided in tenths of Celsius degree, with missing values marked as -9999. First, we will convert the -9999 values to NA by selecting the respective subset and making an assignment:

> tmax[tmax == -9999] = NA

Then, to convert the data into degrees Celsius units, we will divide each of the values by 10:

> tmax = tmax / 10
> tmax[1:10]
 [1]  7.2 13.3 17.8 18.3 11.1  6.7  7.8  8.3 13.9 15.6

Now, let's check the range of values each vector contains:

> range(time)
[1] "1931-03-01" "2014-05-15"
> range(tmax, na.rm = TRUE)
[1] -14.4  41.7

This means that the range of the measured maximum daily temperatures from March 1, 1931 to May 15, 2014 was -14.4 to 41.7 degrees Celsius.

Regarding the dates of measurement, looking at the first few values of the time vector (or at the original CSV file in a spreadsheet, for that matter), it seems that the days are consecutive. However, we may want to make sure that all days of the respective period are indeed present in the file. We can do this by comparing a consecutive sequence all_dates covering the time period from March 1, 1931 to May 15, 2014 with our time vector:

> range_t = range(time)
> all_dates = seq(range_t[1], range_t[length(range_t)], 1)
> length(all_dates)
[1] 30392
> length(time)
[1] 30391

This already indicates that we have an incomplete agreement. Our time vector contains the 30391 values, while there are 30392 dates during the time period from March 1, 1931 to May 15, 2014. Therefore, the CSV file is missing at least one date.

We will next check how many dates (and which ones) are missing. First, we will verify that, indeed, not all dates appear in the time vector using the %in% operator (asking for each element in all_dates whether it appears in the time vector) and the all function (asking whether all of the values in the resulting logical vector are TRUE).

> all(all_dates %in% time)
[1] FALSE

The answer is no; at least one of the dates in the range of March 1, 1931 to May 15, 2014 is indeed missing from the time vector. The next question would be which one is missing, or which ones are missing? We can get the indices of the dates that appear in all_dates but not in time with the which function:

> which(!(all_dates %in% time))
[1] 5499

The missing date is the 5499th element of the all_dates vector. Its value is as follows:

> all_dates[which(!(all_dates %in% time))]
[1] "1946-03-20"

Manually examining the CSV file in a spreadsheet software will confirm that indeed the date March 20, 1946 was skipped for some reason.

Another interesting question we can ask is on what day the highest temperature (which was 41.7 degree Celsius, as we saw earlier) has been observed:

> max(tmax, na.rm = TRUE)
[1] 41.7
time[which.max(tmax)]
[1] "1994-06-26"

The highest maximum daily temperature was observed on June 6, 1994.

Creating subsets based on dates

If we are interested in a particular subset of the time series, say the period from December 31, 2005 to January 1, 2014, we could create a subset of the dates in that period based on the time vector and a respective subset of data values based on the tmax vector. We can do this in two steps. First, we will create a logical vector, w, pointing at those dates we would like to keep:

> w = time > as.Date("2005-12-31") & time < as.Date("2014-1-1")

To find out the ratio between the number of days we would like to keep in the subset and the number of days in the complete series, we can type the following expression:

> sum(w) / length(w)
[1] 0.09614689

The amount of data within the subset we are interested in (December 31, 2005 to January 1, 2014) is about 9.6 percent of the total amount of data since the proportion of the TRUE values count in the logical vector, w, from the total number of values is 0.096 (remember that before summing a logical vector, it is converted to a numeric one with ones instead of TRUE and zeroes instead of FALSE).

Secondly, we will use the w vector to create subsets of both the time and tmax vectors:

> time = time[w]
> tmax = tmax[w]

Note that the selection was non-inclusive of the end dates since we used the > and < operators:

> range(time)
[1] "2006-01-01" "2013-12-31"

If we wanted to include the first and last dates (December 31, 2005 and January 1, 2014), we would rather use the >= and <= operators.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.173.53