Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Inference from tables by joining, reshaping, and aggregating

In this section, you will learn several more advanced operations involving tables. These include, in particular, reshaping of tables and joining the information from table pairs. The presented methods, together with the ones presented earlier, will compose quite a powerful toolbox, which will suffice for all table-related operations that you will use in this book. Since you will be using functions from contributed packages, you will first learn how to download and install them. The following three sections will then introduce functions to reshape, aggregate, and join tables, respectively.

Using contributed packages

All predefined objects in R (such as the functions and classes we have been using so far) are collected in libraries or packages (in R terminology). In order to use an object defined in a certain package, it first needs to be loaded into memory. This is done using the library function. So far, we did not use the library function, so how come we could use all of the functions we have been using? The answer is that several packages are distributed with the R installation file (~30 of them as of May 2014), and some of them are automatically loaded into computer memory when starting R (these are called base R packages); otherwise, we will need to load a package into memory if we would like to use its functions. For example, if we would like to use graphical functions from the lattice package (see the previous chapter), which is automatically installed with R, we need to execute the following expression first:

> library("lattice")

The argument for the library function is the name of the required package that we would like to load.

Tip

By default, the library function can also accept package names without parentheses, so we can type library(lattice) instead of the previous expression.

In addition to the preinstalled packages, many contributed packages (~5500 as of May 2014) are located on the Comprehensive R Archive Network (CRAN), which is a network of FTP and web servers storing up-to-date versions of official R packages (unofficial packages or packages currently under development are available from various other online sources, such as GitHub). To use one of the packages on CRAN, we first have to download it to our computer, which can be automatically done in R using the install.packages function. For example, to install the reshape2 package (which we are going to use shortly), we need to execute the following expression:

> install.packages("reshape2")

Another contributed package we are going to use in this chapter is called plyr. Thus, it is necessary that you download both reshape2 and plyr (using the install.packages function for each one) and load them into memory (using the library function, again for each one) before executing the upcoming examples of code.

Tip

To save space, from now on, the install.packages commands will not be written as part of the code sections, nor will commands loading packages using library be replicated in each and every instance the package is used. Instead, every time a code section requires a newly contributed package, you should make sure the respective package is loaded.

Remember that downloading a package (using the install.packages function) is a one-time procedure (unless a new version of the package came out and we would like to reinstall it). On the other hand, loading the package into memory (using the library function) is something we need to do in each new R session.

Tip

Here are a few more tips concerning the installation of packages:

Packages installation is also accessible through the menus in both RGui and RStudio
When packages installation is triggered, the user is prompted to select one of the CRAN Mirrors to download the package from
When installing a new version of R, all packages will need to be reinstalled

Shifting between long and wide formats using melt and dcast

In this section, and the following one, you are going to learn several useful methods to reshape data. Reshaping operations are an inevitable step of every data analysis process since a lot (if not most) of the time, data we get to work with will be structured differently from what is required to use a given function or type of software. In data reshaping, we change the form our data takes (possibly also reducing its amount, by aggregation), but not the data itself. An example of a tool used for data reshaping, from other software, is the PivotTable tool in Excel.

Note

The functions we are going to use in this and the upcoming sections belong to the contributed packages reshape2 and plyr. There are other ways to perform the presented operations in R; some of them use only base packages. However, the methods shown here are considered more intuitive and flexible. Introduction to these two packages, by their author Hadley Wickham, can be found in the Journal of Statistical Software (see Appendix B, Cited References). Note that one of these papers addresses the reshape package (rather than the more efficient reshape2 package, which was developed later), but the principles of reshape and reshape2 are the same, so the paper is well relevant to reshape2 users. A good introduction to data manipulation with R using (mostly) base packages can be found in the excellent book Introduction to Data Technologies (2009) by Paul Murrell (which is also available online).

The first operation you are going to learn about is the transformation between wide and long formats. A wide table consists of columns for each measured variable. For example, our dat table is in a wide format since it has a column for the station name, a column for the precipitation amount, a column for the minimum temperature, and so on. A long table is one where a single column holds the variable names and another holds the measured values. When switching from wide to long formats, we will usually be interested in intermediate forms, where some of the variables are in columns, and others are identified in a single column holding variables names. In reshape2 terminology, the former are called identifier variables (id.vars), while the latter are called measured variables (measure.vars). This will become clearer using an example.

The iris dataset we saw earlier is also in a wide format. To convert it to a long format, we can use the melt function, which is used to convert wide formats to long formats. When using melt, we need to specify the data.frame object to reshape, and the identity of the ID and measure variables (as character vectors). In fact, we can specify either the ID or measure variables, and the function will assume that the rest of the columns belong to the other kind. For example, the most reasonable approach in this particular case would be to use the species as an ID variable (since it describes each measured unit) and the flower dimensions traits as measure variables (since they are independent measurements conducted on each measured unit). Let's take a look at the following example:

> library(reshape2)
> iris_melt = melt(iris, id.vars = "Species")

The first several rows of the resulting table are printed as follows:

> head(iris_melt)
     Species     variable value
1 versicolor Sepal.Length   5.7
2     setosa Sepal.Length   5.1
3 versicolor Sepal.Length   5.5
4     setosa Sepal.Length   5.5
5     setosa Sepal.Length   4.9
6  virginica Sepal.Length   6.3

We can see that the ID variable (in this case there was only one, Species) retained its status as one of the columns. The rest of the columns, the measure variables (in this case there were four), disappeared; instead, two new columns were created (variable and value), holding the measure variables names and values, respectively. The number of rows is now 24 (four times the original number of rows, six), since the ID part is replicated four times, once for each of the measure variables.

Similarly, we can use the melt function to convert the climatic data table dat to a long format, specifying that the tpcp, mmxt, and mmnt columns contain measured variables. We shall assign it to a different object named dat_melt:

> dat_melt = melt(dat, measure.vars = c("tpcp","mmxt","mmnt"))
> head(dat_melt)
            station station_name elevation latitude longitude
1 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
2 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
3 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
4 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
5 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
6 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
        date month year variable value
1 1984-01-01     1 1984     tpcp  51.4
2 1984-02-01     2 1984     tpcp   0.0
3 1984-03-01     3 1984     tpcp  68.7
4 1984-04-01     4 1984     tpcp  13.6
5 1984-05-01     5 1984     tpcp  21.4
6 1984-06-01     6 1984     tpcp   6.5

We can check and see that the molten table dat_melt has exactly three times more rows than the original table dat.

The long format is useful in its own right in many cases; for example, when we make a plot with three panels, one for each measured variable, we need to have the panel IDs in a single column, which is exactly what we have now. In addition, molten tables serve as an input to another function in the reshape2 package called dcast, which is used to cast the data back into a wide format. However, this time we do not have to return exactly to the original table (on which we previously applied melt). Instead, we can specify exactly what we would like to have in the rows and columns. The way we specify the variables to appear in rows and columns is through a formula object (see the previous chapter), which may have the form: var1+var2+var3~var4+var5. The variables to the left of the ~ operator (in this case, var1, var2, and var3) are going to appear as single columns in the new table; the variables to the right of the ~ operator (in this case, var4 and var5) are going to populate new columns, with the values going back from the value column to these new columns. For convenience, we can use the . symbol to indicate no variable or the ... symbol to indicate all remaining variables, either to the left or right of the ~ operator.

The behavior of dcast can be best demonstrated through examples (for additional examples, see the 2007 paper by Hadley Wickham). For example, to get back our original table, we can indicate that we would like the values in the variable column to form new columns as follows:

> dat2 = dcast(dat_melt, ... ~ variable)

The order of the columns in the resulting table is slightly different; otherwise, the table is identical to the original dat table:

> head(dat2)
            station station_name elevation latitude longitude
1 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
2 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
3 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
4 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
5 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
6 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
        date month year tpcp mmxt mmnt
1 1984-01-01     1 1984 51.4  5.6 -0.5
2 1984-02-01     2 1984  0.0  9.8  2.8
3 1984-03-01     3 1984 68.7  7.2  0.2
4 1984-04-01     4 1984 13.6 12.2  3.5
5 1984-05-01     5 1984 21.4 11.8  3.0
6 1984-06-01     6 1984  6.5 19.6 10.5

Alternately, we can have the months form new columns, as follows:

> dat2 = dcast(dat_melt, station+station_name+variable+year~month)
> head(dat2)
            station     station_name variable year    1    2    3
1 GHCND:SP000003195 MADRID RETIRO SP     tpcp 1984 25.3 37.2 58.0
2 GHCND:SP000003195 MADRID RETIRO SP     tpcp 1985 67.8 45.1  6.0
3 GHCND:SP000003195 MADRID RETIRO SP     tpcp 1986 10.6 57.0 22.2
4 GHCND:SP000003195 MADRID RETIRO SP     tpcp 1987 93.3 42.9  6.7
5 GHCND:SP000003195 MADRID RETIRO SP     tpcp 1988 60.0 20.9  1.1 
6 GHCND:SP000003195 MADRID RETIRO SP     tpcp 1989  9.9 19.5 23.8
     4    5    6    7    8    9   10    11    12
1 39.2 82.4 35.6  0.0  7.1  7.1 30.1 161.0  11.3
2 36.8 29.9 24.1  5.4  0.0  0.0  0.0  39.9  83.8
3 57.3 12.0  1.0 37.1 16.4 47.5 93.8  13.6  19.0
4 63.1 58.4  7.8 43.9 14.5 11.9 58.5  64.8  79.1
5 96.6 46.5 49.8  9.5  0.0  0.0 79.2  50.1   0.2
6 52.2 97.7 12.7 13.3  3.2 37.5  5.9 146.4 138.6

Tip

Note that this time we omitted some of the variables from the resulting table (elevation, latitude, and so on), and made the months' levels (there are 12 of these) appear in separate columns, with the values of the respective climatic variable for each month in the cells of the given column. This form is ideal to answer questions such as which month is the warmest in each station (using apply, for example).

The casting operations we have performed so far involved the retention of all original data in the resulting table. What happens when we instruct the creation of a table that cannot contain all of our original data? In this case, aggregation takes place and we need to specify the function that will be used for aggregation (otherwise, the default function length will be used). For example, we can calculate the mean climatic conditions in Spain per year, as follows:

> dat2 = dcast(dat_melt, year ~ variable, mean, na.rm = TRUE)
> head(dat2)
  year     tpcp     mmxt      mmnt
1 1984 54.34180 19.38194  9.381115
2 1985 45.13103 20.31096  9.793890
3 1986 47.55329 19.75327  9.681250
4 1987 57.09826 20.31684 10.350206
5 1988 47.78863 20.07662  9.934514
6 1989 54.82944 20.97615 10.654617

A disadvantage of aggregation with dcast is that we must apply the same function across all variables. In the next section, you will learn about a more flexible aggregation method.

Aggregating with ddply

There are several functions in the plyr package that are used to apply operations on the subsets of data and then combine the subsets once again into a single object. This may sound familiar; indeed the plyr package was intended to comprise an alternative, in many cases an easier one, to apply and other base R functions. One of the most commonly used functions from this package, and the one you are going to learn about in this section, is called ddply.

Note

New packages namely dplyr and tidyr have recently appeared on CRAN; they are intended to serve as even faster and more efficient alternatives to plyr and (partially) reshape2. Since these packages are currently under development, they are not used in the examples in this book.

The ddply function operates on a data.frame object returning a new data.frame. It first splits the table to subsets according to the unique levels in one or more columns. The data from each subset is then used to calculate a single value, for each of the new columns in the new data.frame object. The user specifies exactly how this will be done; more importantly, the new columns in the resulting table can be calculated based on values from more than one column in the original table.

Let's demonstrate the functionality of ddply on iris. We will calculate the average area size of a flower's petals and sepals, as per species:

> library(plyr)
> ddply(iris,
+ .(Species),
+ summarize,
+ sepal_area = mean(Sepal.Length * Sepal.Width),
+ petal_area = mean(Petal.Length * Petal.Width))
     Species sepal_area petal_area
1     setosa      20.04  0.3933333
2 versicolor         NA  5.2650000
3  virginica      20.79 15.0000000

As we can see, the ddply function call contains several arguments:

The input data.frame (for example, iris).
The name(s) of the column(s), which defines subsets, in parentheses and preceded by . (for example, .(Species)). If there is more than one name, they will be separated by commas.
The mode of operation; possible methods are as follows:
- summarize: The new columns form a new, aggregated, table
- transform: The new columns are appended back to the input table
The fourth argument and onward (fifth, sixth, and so on) are the user-specified expressions for calculation of new columns based on values in the original columns.

The preceding function call thus indicates that we would like to break the iris table into subsets based on the unique values in the Species column, and create a new data.frame object with a column that specifies the levels (Species) and two new columns, sepal_area and petal_area. These columns will contain the means of the products of length and width for the respective trait. Note that the NA value for the sepal_area column of species, versicolor, is due to the NA value we previously inserted.

The instruction to create a new table is given by the word summarize. If we would have replaced the word summarize with transform, the values from the new columns would have been added to the input table, rather than creating a new (aggregated) table, as follows:

> ddply(iris,
+ .(Species),
+ transform,
+ sepal_area = mean(Sepal.Length * Sepal.Width),
+ petal_area = mean(Petal.Length * Petal.Width))
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.8          1.9         0.4     setosa
2          5.5         4.2          1.4         0.2     setosa
3          4.9         3.6          1.4         0.1     setosa
4          5.7         2.8          4.1         1.3 versicolor
5          5.5          NA          4.0         1.3 versicolor
6          6.3         3.3          6.0         2.5  virginica
  sepal_area petal_area
1      20.04  0.3933333
2      20.04  0.3933333
3      20.04  0.3933333
4         NA  5.2650000
5         NA  5.2650000
6      20.79 15.0000000

As you can see, the original table has been preserved; just that our two newly calculated columns (sepal_area and petal_area) have been joined to it on the right. Note that the values in these columns are the averages of the subsets (the species). Thus, within each subset, the values are duplicated.

As another example, we will now use ddply in order to aggregate our climatic data table, from a monthly to an annual timescale. In other words, we would like to obtain annual averages (in case of temperature) or annual sums (in case of precipitation) for climatic variables. For this, we will first filter out those variable/year combinations where not all the 12 months are present. For example, if the minimum monthly temperature data at the "IZANA SP" station for 1985 is available only for 11 (rather than 12) months, we would like to remove the minimum temperature data for that year and from that station altogether, to reduce the bias in the annual average. To do this, we need to find out how many non-missing values we have for every station/year/variable combination. We will use ddply to our molten dat_melt table:

> dat3 = ddply(dat_melt,
+ .(station, year, variable),
+ transform,
+ months_available = length(value[!is.na(value)]))
> head(dat3)
            station     station_name elevation latitude longitude
1 GHCND:SP000003195 MADRID RETIRO SP       667  40.4117   -3.6781
2 GHCND:SP000003195 MADRID RETIRO SP       667  40.4117   -3.6781
3 GHCND:SP000003195 MADRID RETIRO SP       667  40.4117   -3.6781
4 GHCND:SP000003195 MADRID RETIRO SP       667  40.4117   -3.6781
5 GHCND:SP000003195 MADRID RETIRO SP       667  40.4117   -3.6781
6 GHCND:SP000003195 MADRID RETIRO SP       667  40.4117   -3.6781
        date month year variable value months_available
1 1984-01-01     1 1984     tpcp  25.3               12
2 1984-02-01     2 1984     tpcp  37.2               12
3 1984-03-01     3 1984     tpcp  58.0               12
4 1984-04-01     4 1984     tpcp  39.2               12
5 1984-05-01     5 1984     tpcp  82.4               12
6 1984-06-01     6 1984     tpcp  35.6               12

The new table dat3, which we just created, contains all of the data from dat_melt (since transform was used), in addition to the new months_available column, which contains the number of non-NA elements for the respective .(station,year,variable) subset. Using this column, we can now remove those station/year/variable subsets that have less than 12 months of data:

> nrow(dat3)
[1] 85608
> dat3 = dat3[dat3$months_available == 12, ]
> nrow(dat3)
[1] 80976

Overall 4,632 rows have been removed. Now we can aggregate the dat3 table, knowing that the annual values will always be based on 12 months of data. We will do it in two steps.

First, we will create a table to only hold the location data (latitude, longitude, and elevation columns) for each meteorological station. It is frequently useful to have a table such as this, for example, to plot the stations' spatial locations (which we are going to do in Chapter 7, Combining Vector and Raster Datasets):

> spain_stations = ddply(dat3,
+ .(station),
+ summarize,
+ latitude = latitude[1],
+ longitude = longitude[1],
+ elevation = elevation[1])
> head(spain_stations)
            station latitude longitude elevation
1 GHCND:SP000003195  40.4117   -3.6781       667
2 GHCND:SP000004452  38.8831   -6.8292       185
3 GHCND:SP000006155  36.6667   -4.4881         7
4 GHCND:SP000008027  43.3075   -2.0392       251
5 GHCND:SP000008181  41.2928    2.0697         4
6 GHCND:SP000008202  40.9592   -5.4981       790

Here, the aggregation was performed by the station column only; therefore, we obtain a rather short table with one row for each meteorological station (96 rows in total):

> nrow(spain_stations)
[1] 96

Note that with latitude=latitude[1] we say, in plain language: take the first latitude value you see and assign it to the aggregated table, per station. Since the location of a given station should be constant over time, we can take any of the latitude values. However, we do not know exactly how many rows of data each station has (actually, in this particular case, we do know it is at least 12 since otherwise the data for that station could not have formed a complete 12 months series and would have been removed altogether); therefore, selecting the first one is a reasonable option.

We will save this data.frame object to a file since we will use it in subsequent chapters:

> write.csv(spain_stations, "C:\Data\spain_stations.csv",
+ row.names = FALSE)

The additional parameter row.names indicates whether we would like row names to be saved as an additional column in the CSV file (in this case, we do not).

Next, we will aggregate the climatic data itself, per station/variable/year combination. Here, our purpose is to find the sum of each month (in case of rainfall) or the average of each month (in case of temperature). We will use ifelse to assign a sum of the 12 values when the variable is tpcp or the average otherwise (when the variable is either mmxt or mmnt). Let's take a look at the following example:

> spain_annual = ddply(dat3,
+ .(station, variable, year),
+ summarize,
+ value = ifelse(variable[1] == "tpcp",
+ sum(value, na.rm = TRUE),
+ mean(value, na.rm = TRUE)))
> head(spain_annual)
            station variable year value
1 GHCND:SP000003195     tpcp 1984 494.3
2 GHCND:SP000003195     tpcp 1985 338.8
3 GHCND:SP000003195     tpcp 1986 387.5
4 GHCND:SP000003195     tpcp 1987 544.9
5 GHCND:SP000003195     tpcp 1988 413.9
6 GHCND:SP000003195     tpcp 1989 560.7

Note that, once again, we consider only the first element in the variable column (variable[1]) to make the decision on whether to use the sum or mean function since all values of the column variable are, by definition, identical within a given station/year/variable combination.

We will save this data.frame object to a file for later use:

> write.csv(spain_annual, "C:\Data\spain_annual.csv",
+ row.names = FALSE)

Our final exercise related to the processing of tabular data would be to see how we can join the spain_stations and spain_annual tables into a single table, containing both the station coordinates and climatic data. For this, you first have to learn how to join tables, which we shall do in the next section.

Joining tables with join

Joining tables is another common operation in data analysis. Those working with spatial data may be familiar with the task of joining data from an external table (such as an Excel file) with the attribute table of a spatial dataset (such an ESRI Shapefile), which is an example of a join operation.

The plyr library offers a very convenient function called join, to join data.frame objects. Note, once again, that there are other ways to perform the task in R, such as using the merge function from the base packages. However, in addition to its simplicity, an important advantage of join is that it always preserves the original order of rows in the data.frame object we join to. This feature will be especially important later, when performing the previously mentioned task of joining tables to attributes of spatial datasets (see Chapter 5, Working with Points, Lines, and Polygons).

The first two parameters of the join function are x and y, which indicate the names of the two data.frame objects to join, and the third parameter is by, which indicates by which column(s) to join. The other two parameters indicate, by default, that we would like to perform a left join (type="left", retaining all rows of x, as opposed to a "right" join where we retain all rows of y) and retain all records if there are duplicates (match="all"), which is what we would like to do in most cases (see ?join for more details).

For example, let's say we have a table where each row corresponds to a date, and we would like to create a new column that indicates the season that date belongs to (winter, spring, summer, or fall). One way of doing this is to create a table indicating the season each month belongs to, and then join the second table to the first one, according to the common month columns.

For this example, we will read another CSV file with a series of dates. The dates table looks as follows:

> dates = read.csv("C:\Data\modis_dates.csv")
> head(dates)
  image day month year
1     1  18     2 2000
2     2   5     3 2000
3     3  21     3 2000
4     4   6     4 2000
5     5  22     4 2000
6     6   8     5 2000

This table indicates the dates of the MODIS satellite images acquisition with the image column corresponding to bands in a multiband raster, which we will work with later (see the next chapter).

As another exercise of working with dates, we will create a column of the Date class from the day, month, and year columns as follows:

> dates$date = as.Date(
+ paste(dates$year, dates$month, dates$day, sep = "-"))
> head(dates)
  image day month year       date
1     1  18     2 2000 2000-02-18
2     2   5     3 2000 2000-03-05
3     3  21     3 2000 2000-03-21
4     4   6     4 2000 2000-04-06
5     5  22     4 2000 2000-04-22
6     6   8     5 2000 2000-05-08

Let's now create a table of seasons. To do this, we will use the rep function with a parameter we have not used so far, each, which indicates that we want to repeat each element of a given vector several times (rather than repeat the whole vector):

> month = c(12, 1:11)
> month
 [1] 12  1  2  3  4  5  6  7  8  9 10 11
> season = rep(c("winter","spring","summer","fall"), each = 3)
> season
 [1] "winter" "winter" "winter" "spring" "spring" "spring"
 [7] "summer" "summer" "summer" "fall"   "fall"   "fall"  
> seasons = data.frame(month, season)
> seasons
   month season
1     12 winter
2      1 winter
3      2 winter
4      3 spring
5      4 spring
6      5 spring
7      6 summer
8      7 summer
9      8 summer
10     9   fall
11    10   fall
12    11   fall

The seasons table now indicates which season a given month belongs to. The final step will be to join the two tables: dates and seasons. The following expression states that we would like to join the seasons table to the dates table by month:

> dates = join(dates, seasons, "month")
> head(dates)
  image day month year       date season
1     1  18     2 2000 2000-02-18 winter
2     2   5     3 2000 2000-03-05 spring
3     3  21     3 2000 2000-03-21 spring
4     4   6     4 2000 2000-04-06 spring
5     5  22     4 2000 2000-04-22 spring
6     6   8     5 2000 2000-05-08 spring

We will use this table in several examples in the subsequent chapters.

Returning to our climatic data example, we will now join the two tables we got in the previous section: the stations, coordinates summary (spain_stations) and the aggregated annual climatic data (spain_annual). The resulting data.frame object will be named combined, as shown in the following expression:

> combined = join(spain_stations, 
+ spain_annual, 
+ by = "station", 
+ type = "right")

Note that here we use the type="right" option since we would like to retain all rows in the second table spain_annual (rather than retain all the rows in the first table, as shown in the previous example).

The table looks as follows:

> head(combined)
            station latitude longitude elevation variable year
1 GHCND:SP000003195  40.4117   -3.6781       667     tpcp 1984
2 GHCND:SP000003195  40.4117   -3.6781       667     tpcp 1985
3 GHCND:SP000003195  40.4117   -3.6781       667     tpcp 1986
4 GHCND:SP000003195  40.4117   -3.6781       667     tpcp 1987
5 GHCND:SP000003195  40.4117   -3.6781       667     tpcp 1988
6 GHCND:SP000003195  40.4117   -3.6781       667     tpcp 1989
  value
1 494.3
2 338.8
3 387.5
4 544.9
5 413.9
6 560.7

As the output shows, this table contains duplicated data since the latitude, longitude, and elevation records are identical in each station/variable/year combination. In such cases, it is more efficient to keep two separate tables (such as spain_stations and spain_annual) rather than join all the data into a single table (such as combined).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Inference from tables by joining, reshaping, and aggregating

Create new playlist

Sign In

Sign Up

Inference from tables by joining, reshaping, and aggregating

Using contributed packages

Tip

Tip

Tip

Shifting between long and wide formats using melt and dcast

Note

Tip

Aggregating with ddply

Note

Joining tables with join

Table of Contents for
Inference from tables by joining, reshaping, and aggregating