Merging datasets

Besides the previously described elementary actions on a single dataset, joining multiple data sources is one of the most used methods in everyday action. The most often used solution for such a task is to simply call the merge S3 method, which can act as a traditional SQL inner and left/right/full outer joiner of operations—represented in a brief summary by C.L. Moffatt (2008) as follows:

Merging datasets

The dplyr package provides some easy ways for doing the previously presented join operations right from R, in an easy way:

  • inner_join: This joins the variables of all the rows, which are found in both datasets
  • left_join: This includes all the rows from the first dataset and join variables from the other table
  • semi_join: This includes only those rows from the first dataset that are found in the other one as well
  • anti_join: This is similar to semi_join, but includes only those rows from the first dataset that are not found in the other one

    Note

    For more examples, take a look at the Two-table verbs dplyr vignette, and the Data Wrangling cheat sheet listed in the References chapter at the end of the book.

These features are also supported by the mult argument of [ operator of data.table call, but for the time being, let's stick to the simpler use cases.

In the following example, we will merge a tiny dataset with the hflights data. Let's create the data.frame demo by assigning names to the possible values of the DayOfWeek variable:

> (wdays <- data.frame(
+     DayOfWeek       = 1:7,
+     DayOfWeekString = c("Sunday", "Monday", "Tuesday",
+         "Wednesday", "Thursday", "Friday", "Saturday")
+     ))
  DayOfWeek DayOfWeekString
1         1          Sunday
2         2          Monday
3         3         Tuesday
4         4       Wednesday
5         5        Thursday
6         6          Friday
7         7        Saturday

Let's see how we can left-join the previously defined data.frame with another data.frame and other tabular objects, as merge also supports fast operations on, for example, data.table:

> system.time(merge(hflights, wdays))
   user  system elapsed 
  0.700   0.000   0.699 
> system.time(merge(hflights_dt, wdays, by = 'DayOfWeek'))
   user  system elapsed 
  0.006   0.000   0.009

The prior example automatically merged the two tables via the DayOfWeek variable, which was part of both datasets and resulted in an extra variable in the original hflights dataset. However, we had to pass the variable name in the second example, as the by argument of merge.data.table defaults to the key variable of the object, which was missing then. One thing to note is that merging with data.table was a lot faster than the traditional tabular object type.

Note

Any ideas on how to improve the previous didactical example? Instead of merging, the new variable could be computed as well. See for example, the weekdays function from base R: weekdays(as.Date(with(hflights, paste(Year, Month, DayofMonth, sep = '-')))).

A much simpler way of merging datasets is when you simply want to add new rows or columns to the dataset with the same structure. For this end, rbind and cbind, or rBind and cBind for sparse matrices, do a wonderful job.

One of the most often used functions along with these base commands is do.call, which can execute the rbind or cbind call on all elements of a list, thus enabling us, for example, to join a list of data frames. Such lists are usually created by lapply or the related functions from the plyr package. Similarly, rbindlist can be called to merge a list of data.table objects in a much faster way.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.39.133