In this section, you will learn several more advanced operations involving tables. These include, in particular, reshaping of tables and joining the information from table pairs. The presented methods, together with the ones presented earlier, will compose quite a powerful toolbox, which will suffice for all table-related operations that you will use in this book. Since you will be using functions from contributed packages, you will first learn how to download and install them. The following three sections will then introduce functions to reshape, aggregate, and join tables, respectively.
All predefined objects in R (such as the functions and classes we have been using so far) are collected in libraries or packages (in R terminology). In order to use an object defined in a certain package, it first needs to be loaded into memory. This is done using the library
function. So far, we did not use the library
function, so how come we could use all of the functions we have been using? The answer is that several packages are distributed with the R installation file (~30 of them as of May 2014), and some of them are automatically loaded into computer memory when starting R (these are called base R packages); otherwise, we will need to load a package into memory if we would like to use its functions. For example, if we would like to use graphical functions from the lattice
package (see the previous chapter), which is automatically installed with R, we need to execute the following expression first:
> library("lattice")
The argument for the library
function is the name of the required package that we would like to load.
In addition to the preinstalled packages, many contributed packages (~5500 as of May 2014) are located on the Comprehensive R Archive Network (CRAN), which is a network of FTP and web servers storing up-to-date versions of official R packages (unofficial packages or packages currently under development are available from various other online sources, such as GitHub). To use one of the packages on CRAN, we first have to download it to our computer, which can be automatically done in R using the install.packages
function. For example, to install the reshape2
package (which we are going to use shortly), we need to execute the following expression:
> install.packages("reshape2")
Another contributed package we are going to use in this chapter is called plyr
. Thus, it is necessary that you download both reshape2
and plyr
(using the install.packages
function for each one) and load them into memory (using the library
function, again for each one) before executing the upcoming examples of code.
To save space, from now on, the install.packages
commands will not be written as part of the code sections, nor will commands loading packages using library
be replicated in each and every instance the package is used. Instead, every time a code section requires a newly contributed package, you should make sure the respective package is loaded.
Remember that downloading a package (using the install.packages
function) is a one-time procedure (unless a new version of the package came out and we would like to reinstall it). On the other hand, loading the package into memory (using the library
function) is something we need to do in each new R session.
Here are a few more tips concerning the installation of packages:
In this section, and the following one, you are going to learn several useful methods to reshape data. Reshaping operations are an inevitable step of every data analysis process since a lot (if not most) of the time, data we get to work with will be structured differently from what is required to use a given function or type of software. In data reshaping, we change the form our data takes (possibly also reducing its amount, by aggregation), but not the data itself. An example of a tool used for data reshaping, from other software, is the PivotTable tool in Excel.
The functions we are going to use in this and the upcoming sections belong to the contributed packages reshape2
and plyr
. There are other ways to perform the presented operations in R; some of them use only base packages. However, the methods shown here are considered more intuitive and flexible. Introduction to these two packages, by their author Hadley Wickham, can be found in the Journal of Statistical Software (see Appendix B, Cited References). Note that one of these papers addresses the reshape
package (rather than the more efficient reshape2
package, which was developed later), but the principles of reshape
and reshape2
are the same, so the paper is well relevant to reshape2
users. A good introduction to data manipulation with R using (mostly) base packages can be found in the excellent book Introduction to Data Technologies (2009) by Paul Murrell (which is also available online).
The first operation you are going to learn about is the transformation between wide and long formats. A wide table consists of columns for each measured variable. For example, our dat
table is in a wide format since it has a column for the station name, a column for the precipitation amount, a column for the minimum temperature, and so on. A long table is one where a single column holds the variable names and another holds the measured values. When switching from wide to long formats, we will usually be interested in intermediate forms, where some of the variables are in columns, and others are identified in a single column holding variables names. In reshape2
terminology, the former are called identifier variables (id.vars
), while the latter are called measured variables (measure.vars
). This will become clearer using an example.
The iris
dataset we saw earlier is also in a wide format. To convert it to a long format, we can use the melt
function, which is used to convert wide formats to long formats. When using melt
, we need to specify the data.frame
object to reshape, and the identity of the ID and measure variables (as character vectors). In fact, we can specify either the ID or measure variables, and the function will assume that the rest of the columns belong to the other kind. For example, the most reasonable approach in this particular case would be to use the species as an ID variable (since it describes each measured unit) and the flower dimensions traits as measure variables (since they are independent measurements conducted on each measured unit). Let's take a look at the following example:
> library(reshape2) > iris_melt = melt(iris, id.vars = "Species")
The first several rows of the resulting table are printed as follows:
> head(iris_melt) Species variable value 1 versicolor Sepal.Length 5.7 2 setosa Sepal.Length 5.1 3 versicolor Sepal.Length 5.5 4 setosa Sepal.Length 5.5 5 setosa Sepal.Length 4.9 6 virginica Sepal.Length 6.3
We can see that the ID variable (in this case there was only one, Species
) retained its status as one of the columns. The rest of the columns, the measure variables (in this case there were four), disappeared; instead, two new columns were created (variable
and value
), holding the measure variables names and values, respectively. The number of rows is now 24 (four times the original number of rows, six), since the ID part is replicated four times, once for each of the measure variables.
Similarly, we can use the melt
function to convert the climatic data table dat
to a long format, specifying that the tpcp
, mmxt
, and mmnt
columns contain measured variables. We shall assign it to a different object named dat_melt
:
> dat_melt = melt(dat, measure.vars = c("tpcp","mmxt","mmnt")) > head(dat_melt) station station_name elevation latitude longitude 1 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 2 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 3 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 4 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 5 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 6 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 date month year variable value 1 1984-01-01 1 1984 tpcp 51.4 2 1984-02-01 2 1984 tpcp 0.0 3 1984-03-01 3 1984 tpcp 68.7 4 1984-04-01 4 1984 tpcp 13.6 5 1984-05-01 5 1984 tpcp 21.4 6 1984-06-01 6 1984 tpcp 6.5
We can check and see that the molten table dat_melt
has exactly three times more rows than the original table dat
.
The long format is useful in its own right in many cases; for example, when we make a plot with three panels, one for each measured variable, we need to have the panel IDs in a single column, which is exactly what we have now. In addition, molten tables serve as an input to another function in the reshape2
package called dcast
, which is used to cast the data back into a wide format. However, this time we do not have to return exactly to the original table (on which we previously applied melt
). Instead, we can specify exactly what we would like to have in the rows and columns. The way we specify the variables to appear in rows and columns is through a formula
object (see the previous chapter), which may have the form: var1+var2+var3~var4+var5
. The variables to the left of the ~
operator (in this case, var1
, var2
, and var3
) are going to appear as single columns in the new table; the variables to the right of the ~
operator (in this case, var4
and var5
) are going to populate new columns, with the values going back from the value
column to these new columns. For convenience, we can use the .
symbol to indicate no variable or the ...
symbol to indicate all remaining variables, either to the left or right of the ~
operator.
The behavior of dcast
can be best demonstrated through examples (for additional examples, see the 2007 paper by Hadley Wickham). For example, to get back our original table, we can indicate that we would like the values in the variable column to form new columns as follows:
> dat2 = dcast(dat_melt, ... ~ variable)
The order of the columns in the resulting table is slightly different; otherwise, the table is identical to the original dat
table:
> head(dat2) station station_name elevation latitude longitude 1 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 2 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 3 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 4 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 5 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 6 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 date month year tpcp mmxt mmnt 1 1984-01-01 1 1984 51.4 5.6 -0.5 2 1984-02-01 2 1984 0.0 9.8 2.8 3 1984-03-01 3 1984 68.7 7.2 0.2 4 1984-04-01 4 1984 13.6 12.2 3.5 5 1984-05-01 5 1984 21.4 11.8 3.0 6 1984-06-01 6 1984 6.5 19.6 10.5
Alternately, we can have the months form new columns, as follows:
> dat2 = dcast(dat_melt, station+station_name+variable+year~month) > head(dat2) station station_name variable year 1 2 3 1 GHCND:SP000003195 MADRID RETIRO SP tpcp 1984 25.3 37.2 58.0 2 GHCND:SP000003195 MADRID RETIRO SP tpcp 1985 67.8 45.1 6.0 3 GHCND:SP000003195 MADRID RETIRO SP tpcp 1986 10.6 57.0 22.2 4 GHCND:SP000003195 MADRID RETIRO SP tpcp 1987 93.3 42.9 6.7 5 GHCND:SP000003195 MADRID RETIRO SP tpcp 1988 60.0 20.9 1.1 6 GHCND:SP000003195 MADRID RETIRO SP tpcp 1989 9.9 19.5 23.8 4 5 6 7 8 9 10 11 12 1 39.2 82.4 35.6 0.0 7.1 7.1 30.1 161.0 11.3 2 36.8 29.9 24.1 5.4 0.0 0.0 0.0 39.9 83.8 3 57.3 12.0 1.0 37.1 16.4 47.5 93.8 13.6 19.0 4 63.1 58.4 7.8 43.9 14.5 11.9 58.5 64.8 79.1 5 96.6 46.5 49.8 9.5 0.0 0.0 79.2 50.1 0.2 6 52.2 97.7 12.7 13.3 3.2 37.5 5.9 146.4 138.6
Note that this time we omitted some of the variables from the resulting table (elevation
, latitude
, and so on), and made the months' levels (there are 12 of these) appear in separate columns, with the values of the respective climatic variable for each month in the cells of the given column. This form is ideal to answer questions such as which month is the warmest in each station (using apply
, for example).
The casting operations we have performed so far involved the retention of all original data in the resulting table. What happens when we instruct the creation of a table that cannot contain all of our original data? In this case, aggregation takes place and we need to specify the function that will be used for aggregation (otherwise, the default function length
will be used). For example, we can calculate the mean climatic conditions in Spain per year, as follows:
> dat2 = dcast(dat_melt, year ~ variable, mean, na.rm = TRUE) > head(dat2) year tpcp mmxt mmnt 1 1984 54.34180 19.38194 9.381115 2 1985 45.13103 20.31096 9.793890 3 1986 47.55329 19.75327 9.681250 4 1987 57.09826 20.31684 10.350206 5 1988 47.78863 20.07662 9.934514 6 1989 54.82944 20.97615 10.654617
A disadvantage of aggregation with dcast
is that we must apply the same function across all variables. In the next section, you will learn about a more flexible aggregation method.
There are several functions in the plyr
package that are used to apply operations on the subsets of data and then combine the subsets once again into a single object. This may sound familiar; indeed the plyr
package was intended to comprise an alternative, in many cases an easier one, to apply
and other base R functions. One of the most commonly used functions from this package, and the one you are going to learn about in this section, is called ddply
.
The ddply
function operates on a data.frame
object returning a new data.frame
. It first splits the table to subsets according to the unique levels in one or more columns. The data from each subset is then used to calculate a single value, for each of the new columns in the new data.frame
object. The user specifies exactly how this will be done; more importantly, the new columns in the resulting table can be calculated based on values from more than one column in the original table.
Let's demonstrate the functionality of ddply
on iris
. We will calculate the average area size of a flower's petals and sepals, as per species:
> library(plyr) > ddply(iris, + .(Species), + summarize, + sepal_area = mean(Sepal.Length * Sepal.Width), + petal_area = mean(Petal.Length * Petal.Width)) Species sepal_area petal_area 1 setosa 20.04 0.3933333 2 versicolor NA 5.2650000 3 virginica 20.79 15.0000000
As we can see, the ddply
function call contains several arguments:
data.frame
(for example, iris
)..
(for example, .(Species)
). If there is more than one name, they will be separated by commas.summarize
: The new columns form a new, aggregated, tabletransform
: The new columns are appended back to the input tableThe preceding function call thus indicates that we would like to break the iris
table into subsets based on the unique values in the Species
column, and create a new data.frame
object with a column that specifies the levels (Species
) and two new columns, sepal_area
and petal_area
. These columns will contain the means of the products of length and width for the respective trait. Note that the NA
value for the sepal_area
column of species, versicolor
, is due to the NA
value we previously inserted.
The instruction to create a new table is given by the word summarize
. If we would have replaced the word summarize
with transform
, the values from the new columns would have been added to the input table, rather than creating a new (aggregated) table, as follows:
> ddply(iris, + .(Species), + transform, + sepal_area = mean(Sepal.Length * Sepal.Width), + petal_area = mean(Petal.Length * Petal.Width)) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.8 1.9 0.4 setosa 2 5.5 4.2 1.4 0.2 setosa 3 4.9 3.6 1.4 0.1 setosa 4 5.7 2.8 4.1 1.3 versicolor 5 5.5 NA 4.0 1.3 versicolor 6 6.3 3.3 6.0 2.5 virginica sepal_area petal_area 1 20.04 0.3933333 2 20.04 0.3933333 3 20.04 0.3933333 4 NA 5.2650000 5 NA 5.2650000 6 20.79 15.0000000
As you can see, the original table has been preserved; just that our two newly calculated columns (sepal_area
and petal_area
) have been joined to it on the right. Note that the values in these columns are the averages of the subsets (the species). Thus, within each subset, the values are duplicated.
As another example, we will now use ddply
in order to aggregate our climatic data table, from a monthly to an annual timescale. In other words, we would like to obtain annual averages (in case of temperature) or annual sums (in case of precipitation) for climatic variables. For this, we will first filter out those variable/year combinations where not all the 12 months are present. For example, if the minimum monthly temperature data at the "IZANA SP"
station for 1985 is available only for 11 (rather than 12) months, we would like to remove the minimum temperature data for that year and from that station altogether, to reduce the bias in the annual average. To do this, we need to find out how many non-missing values we have for every station/year/variable combination. We will use ddply
to our molten dat_melt
table:
> dat3 = ddply(dat_melt, + .(station, year, variable), + transform, + months_available = length(value[!is.na(value)])) > head(dat3) station station_name elevation latitude longitude 1 GHCND:SP000003195 MADRID RETIRO SP 667 40.4117 -3.6781 2 GHCND:SP000003195 MADRID RETIRO SP 667 40.4117 -3.6781 3 GHCND:SP000003195 MADRID RETIRO SP 667 40.4117 -3.6781 4 GHCND:SP000003195 MADRID RETIRO SP 667 40.4117 -3.6781 5 GHCND:SP000003195 MADRID RETIRO SP 667 40.4117 -3.6781 6 GHCND:SP000003195 MADRID RETIRO SP 667 40.4117 -3.6781 date month year variable value months_available 1 1984-01-01 1 1984 tpcp 25.3 12 2 1984-02-01 2 1984 tpcp 37.2 12 3 1984-03-01 3 1984 tpcp 58.0 12 4 1984-04-01 4 1984 tpcp 39.2 12 5 1984-05-01 5 1984 tpcp 82.4 12 6 1984-06-01 6 1984 tpcp 35.6 12
The new table dat3
, which we just created, contains all of the data from dat_melt
(since transform
was used), in addition to the new months_available
column, which contains the number of non-NA
elements for the respective .(station,year,variable)
subset. Using this column, we can now remove those station/year/variable subsets that have less than 12 months of data:
> nrow(dat3) [1] 85608 > dat3 = dat3[dat3$months_available == 12, ] > nrow(dat3) [1] 80976
Overall 4,632 rows have been removed. Now we can aggregate the dat3
table, knowing that the annual values will always be based on 12 months of data. We will do it in two steps.
First, we will create a table to only hold the location data (latitude
, longitude
, and elevation
columns) for each meteorological station. It is frequently useful to have a table such as this, for example, to plot the stations' spatial locations (which we are going to do in Chapter 7, Combining Vector and Raster Datasets):
> spain_stations = ddply(dat3, + .(station), + summarize, + latitude = latitude[1], + longitude = longitude[1], + elevation = elevation[1]) > head(spain_stations) station latitude longitude elevation 1 GHCND:SP000003195 40.4117 -3.6781 667 2 GHCND:SP000004452 38.8831 -6.8292 185 3 GHCND:SP000006155 36.6667 -4.4881 7 4 GHCND:SP000008027 43.3075 -2.0392 251 5 GHCND:SP000008181 41.2928 2.0697 4 6 GHCND:SP000008202 40.9592 -5.4981 790
Here, the aggregation was performed by the station
column only; therefore, we obtain a rather short table with one row for each meteorological station (96 rows in total):
> nrow(spain_stations) [1] 96
Note that with latitude=latitude[1]
we say, in plain language: take the first latitude value you see and assign it to the aggregated table, per station. Since the location of a given station should be constant over time, we can take any of the latitude values. However, we do not know exactly how many rows of data each station has (actually, in this particular case, we do know it is at least 12 since otherwise the data for that station could not have formed a complete 12 months series and would have been removed altogether); therefore, selecting the first one is a reasonable option.
We will save this data.frame
object to a file since we will use it in subsequent chapters:
> write.csv(spain_stations, "C:\Data\spain_stations.csv", + row.names = FALSE)
The additional parameter row.names
indicates whether we would like row names to be saved as an additional column in the CSV file (in this case, we do not).
Next, we will aggregate the climatic data itself, per station/variable/year combination. Here, our purpose is to find the sum of each month (in case of rainfall) or the average of each month (in case of temperature). We will use ifelse
to assign a sum of the 12 values when the variable is tpcp
or the average otherwise (when the variable is either mmxt
or mmnt
). Let's take a look at the following example:
> spain_annual = ddply(dat3, + .(station, variable, year), + summarize, + value = ifelse(variable[1] == "tpcp", + sum(value, na.rm = TRUE), + mean(value, na.rm = TRUE))) > head(spain_annual) station variable year value 1 GHCND:SP000003195 tpcp 1984 494.3 2 GHCND:SP000003195 tpcp 1985 338.8 3 GHCND:SP000003195 tpcp 1986 387.5 4 GHCND:SP000003195 tpcp 1987 544.9 5 GHCND:SP000003195 tpcp 1988 413.9 6 GHCND:SP000003195 tpcp 1989 560.7
Note that, once again, we consider only the first element in the variable column (variable[1]
) to make the decision on whether to use the sum
or mean
function since all values of the column variable are, by definition, identical within a given station/year/variable combination.
We will save this data.frame
object to a file for later use:
> write.csv(spain_annual, "C:\Data\spain_annual.csv", + row.names = FALSE)
Our final exercise related to the processing of tabular data would be to see how we can join the spain_stations
and spain_annual
tables into a single table, containing both the station coordinates and climatic data. For this, you first have to learn how to join tables, which we shall do in the next section.
Joining tables is another common operation in data analysis. Those working with spatial data may be familiar with the task of joining data from an external table (such as an Excel file) with the attribute table of a spatial dataset (such an ESRI Shapefile), which is an example of a join operation.
The plyr
library offers a very convenient function called join
, to join data.frame
objects. Note, once again, that there are other ways to perform the task in R, such as using the merge
function from the base packages. However, in addition to its simplicity, an important advantage of join
is that it always preserves the original order of rows in the data.frame
object we join to. This feature will be especially important later, when performing the previously mentioned task of joining tables to attributes of spatial datasets (see Chapter 5, Working with Points, Lines, and Polygons).
The first two parameters of the join
function are x
and y
, which indicate the names of the two data.frame
objects to join, and the third parameter is by
, which indicates by which column(s) to join. The other two parameters indicate, by default, that we would like to perform a left join (type="left"
, retaining all rows of x
, as opposed to a "right"
join where we retain all rows of y
) and retain all records if there are duplicates (match="all"
), which is what we would like to do in most cases (see ?join
for more details).
For example, let's say we have a table where each row corresponds to a date, and we would like to create a new column that indicates the season that date belongs to (winter, spring, summer, or fall). One way of doing this is to create a table indicating the season each month belongs to, and then join the second table to the first one, according to the common month columns.
For this example, we will read another CSV file with a series of dates. The dates
table looks as follows:
> dates = read.csv("C:\Data\modis_dates.csv") > head(dates) image day month year 1 1 18 2 2000 2 2 5 3 2000 3 3 21 3 2000 4 4 6 4 2000 5 5 22 4 2000 6 6 8 5 2000
This table indicates the dates of the MODIS satellite images acquisition with the image
column corresponding to bands in a multiband raster, which we will work with later (see the next chapter).
As another exercise of working with dates, we will create a column of the Date
class from the day
, month
, and year
columns as follows:
> dates$date = as.Date( + paste(dates$year, dates$month, dates$day, sep = "-")) > head(dates) image day month year date 1 1 18 2 2000 2000-02-18 2 2 5 3 2000 2000-03-05 3 3 21 3 2000 2000-03-21 4 4 6 4 2000 2000-04-06 5 5 22 4 2000 2000-04-22 6 6 8 5 2000 2000-05-08
Let's now create a table of seasons. To do this, we will use the rep
function with a parameter we have not used so far, each
, which indicates that we want to repeat each element of a given vector several times (rather than repeat the whole vector):
> month = c(12, 1:11) > month [1] 12 1 2 3 4 5 6 7 8 9 10 11 > season = rep(c("winter","spring","summer","fall"), each = 3) > season [1] "winter" "winter" "winter" "spring" "spring" "spring" [7] "summer" "summer" "summer" "fall" "fall" "fall" > seasons = data.frame(month, season) > seasons month season 1 12 winter 2 1 winter 3 2 winter 4 3 spring 5 4 spring 6 5 spring 7 6 summer 8 7 summer 9 8 summer 10 9 fall 11 10 fall 12 11 fall
The seasons
table now indicates which season a given month belongs to. The final step will be to join the two tables: dates
and seasons
. The following expression states that we would like to join the seasons
table to the dates
table by month:
> dates = join(dates, seasons, "month") > head(dates) image day month year date season 1 1 18 2 2000 2000-02-18 winter 2 2 5 3 2000 2000-03-05 spring 3 3 21 3 2000 2000-03-21 spring 4 4 6 4 2000 2000-04-06 spring 5 5 22 4 2000 2000-04-22 spring 6 6 8 5 2000 2000-05-08 spring
We will use this table in several examples in the subsequent chapters.
Returning to our climatic data example, we will now join the two tables we got in the previous section: the stations, coordinates summary (spain_stations
) and the aggregated annual climatic data (spain_annual
). The resulting data.frame
object will be named combined
, as shown in the following expression:
> combined = join(spain_stations, + spain_annual, + by = "station", + type = "right")
Note that here we use the type="right"
option since we would like to retain all rows in the second table spain_annual
(rather than retain all the rows in the first table, as shown in the previous example).
The table looks as follows:
> head(combined) station latitude longitude elevation variable year 1 GHCND:SP000003195 40.4117 -3.6781 667 tpcp 1984 2 GHCND:SP000003195 40.4117 -3.6781 667 tpcp 1985 3 GHCND:SP000003195 40.4117 -3.6781 667 tpcp 1986 4 GHCND:SP000003195 40.4117 -3.6781 667 tpcp 1987 5 GHCND:SP000003195 40.4117 -3.6781 667 tpcp 1988 6 GHCND:SP000003195 40.4117 -3.6781 667 tpcp 1989 value 1 494.3 2 338.8 3 387.5 4 544.9 5 413.9 6 560.7
As the output shows, this table contains duplicated data since the latitude
, longitude
, and elevation
records are identical in each station/variable/year combination. In such cases, it is more efficient to keep two separate tables (such as spain_stations
and spain_annual
) rather than join all the data into a single table (such as combined
).
3.149.241.250