Using dplyr pipelines to manipulate data frames

Another popular package is dplyr, which invents a grammar of data manipulation. Instead of using the subset function ([]), dplyr defines a set of basic erb functions as the building blocks of data operations and imports a pipeline operator to chain these functions to perform complex multistep tasks.

Run the following code to install dplyr from CRAN if you don't have it yet:

install.packages("dplyr") 

First, we will reload the product tables again to reset all data to their original forms:

library(readr) 
product_info <- read_csv("data/product-info.csv") 
product_stats <- read_csv("data/product-stats.csv") 
product_tests <- read_csv("data/product-tests.csv") 
toy_tests <- read_csv("data/product-toy-tests.csv") 

Then, we will load the dplyr package:

library(dplyr) 
##  
## Attaching package: 'dplyr' 
## The following objects are masked from 'package:data.table': 
##  
##     between, last 
## The following objects are masked from 'package:stats': 
##  
##     filter, lag 
## The following objects are masked from 'package:base': 
##  
##     intersect, setdiff, setequal, union 

The following output indicates that dplyr generalizes a number of built-in functions, so they are masked after the package is attached.

Now, we can start to play with the verb functions it provides. First, we will use select to select columns from the provided data frame by creating a new table with the given columns:

select(product_info, id, name, type, class) 
## Source: local data frame [6 x 4] 
##  
##      id      name  type   class 
##   (chr)     (chr) (chr)   (chr) 
## 1   T01    SupCar   toy vehicle 
## 2   T02  SupPlane   toy vehicle 
## 3   M01     JeepX model vehicle 
## 4   M02 AircraftX model vehicle 
## 5   M03    Runner model  people 
## 6   M04    Dancer model  people 

The printing of the preceding table is a bit different from the way both data.frame and data.table are printed. It not only shows the table itself, but also includes a header indicating the size of the data frame and the data types of each column.

It is clear that select() uses non-standard evaluation that allows us to directly use column names of the given data frame as arguments. It works in a way similar to how subset()transform(), and with() work.

We can use filter to filter the data frame by logical condition, which is also evaluated in the context of the data frame:

filter(product_info, released == "yes") 
## Source: local data frame [4 x 5] 
##  
##      id      name  type   class released 
##   (chr)     (chr) (chr)   (chr)    (chr) 
## 1   T01    SupCar   toy vehicle      yes 
## 2   M01     JeepX model vehicle      yes 
## 3   M02 AircraftX model vehicle      yes 
## 4   M03    Runner model  people      yes 

If we want to filter records with multiple conditions, we only need to write each condition as an argument of filter():

filter(product_info, 
  released == "yes", type == "model") 
## Source: local data frame [3 x 5] 
##  
##      id      name  type   class released 
##   (chr)     (chr) (chr)   (chr)    (chr) 
## 1   M01     JeepX model vehicle      yes 
## 2   M02 AircraftX model vehicle      yes 
## 3   M03    Runner model  people      yes 

The mutate function is used to create a new data frame with new columns added or existing columns replaced, like transform, but also supports in-place assignment, :=, if the provided data is a data.table:

mutate(product_stats, density = size / weight) 
## Source: local data frame [6 x 5] 
##  
##      id material  size weight   density 
##   (chr)    (chr) (int)  (dbl)     (dbl) 
## 1   T01    Metal   120   10.0 12.000000 
## 2   T02    Metal   350   45.0  7.777778 
## 3   M01 Plastics    50     NA        NA 
## 4   M02 Plastics    85    3.0 28.333333 
## 5   M03     Wood    15     NA        NA 
## 6   M04     Wood    16    0.6 26.666667 

The arrange function is used to create a new data frame sorted by one or more columns. The desc() function indicates the descending order:

arrange(product_stats, material, desc(size), desc(weight)) 
## Source: local data frame [6 x 4] 
##  
##      id material  size weight 
##   (chr)    (chr) (int)  (dbl) 
## 1   T02    Metal   350   45.0 
## 2   T01    Metal   120   10.0 
## 3   M02 Plastics    85    3.0 
## 4   M01 Plastics    50     NA 
## 5   M04     Wood    16    0.6 
## 6   M03     Wood    15     NA 

The dplyr function provides a rich set of join operations, including  inner_joinleft_joinright_joinfull_joinsemi_join, and anti_join. If two tables to join have records that do not match, these join operations may behave very differently. For product_info and product_tests, the records match exactly, so left_join should return the same results as merge:

product_info_tests <- left_join(product_info, product_tests, by = "id") 
product_info_tests 
## Source: local data frame [6 x 8] 
##  
##      id      name  type   class released quality durability 
##   (chr)     (chr) (chr)   (chr)    (chr)   (int)      (int) 
## 1   T01    SupCar   toy vehicle      yes      NA         10 
## 2   T02  SupPlane   toy vehicle       no      10          9 
## 3   M01     JeepX model vehicle      yes       6          4 
## 4   M02 AircraftX model vehicle      yes       6          5 
## 5   M03    Runner model  people      yes       5         NA 
## 6   M04    Dancer model  people       no       6          6 
## Variables not shown: waterproof (chr) 

To know more about the difference between those join operations, run ?dplyr::join.

To summarize the data by groups, we need to first create a grouped table by group_by(). Then, we will use summarize() to aggregate the data. For example, we will divide product_info_tests with type and class, and then for each type class group, we will calculate the average values of quality and durability:

summarize(group_by(product_info_tests, type, class), 
  mean_quality = mean(quality, na.rm = TRUE), 
  mean_durability = mean(durability, na.rm = TRUE)) 
## Source: local data frame [3 x 4] 
## Groups: type [?] 
##  
##    type   class mean_quality mean_durability 
##   (chr)   (chr)        (dbl)           (dbl) 
## 1 model  people          5.5             6.0 
## 2 model vehicle          6.0             4.5 
## 3   toy vehicle         10.0             9.5 

From the preceding code examples, you learned the verb functions select()filter()mutate()arrange()group_by(), and summarize(). Each of them is designed to do a small thing, but together they can perform comprehensive data operations when properly composed. Apart from these functions, dplyr imports the pipeline operator %>% from the magrittr package to compose functions into pipelines.

Suppose we have product_info and product_tests. We need to analyze the released product by computing the average values of quality and durability for each type class group, and present the summary data in descending order of the average quality. This can be done nicely with the dplyr verb functions composed by the pipeline operator:

product_info %>% 
  filter(released == "yes") %>% 
  inner_join(product_tests, by = "id") %>% 
  group_by(type, class) %>% 
  summarize( 
    mean_quality = mean(quality, na.rm = TRUE), 
    mean_durability = mean(durability, na.rm = TRUE)) %>% 
  arrange(desc(mean_quality)) 
## Source: local data frame [3 x 4] 
## Groups: type [2] 
##  
##    type   class mean_quality mean_durability 
##   (chr)   (chr)        (dbl)           (dbl) 
## 1 model vehicle            6             4.5 
## 2 model  people            5             NaN 
## 3   toy vehicle          NaN            10.0 

But how does %>% work? The pipeline operator basically does only one thing: put the result on the left-hand side of the first argument of the function call on the right-hand side, that is, x %>% f(...) will be basically evaluated as f(x, ...). Since %>% is a package-defined binary operator, it allows us to chain function calls to either avoid redundant intermediate values or decompose nested calls.

Suppose we need to transform d0 to d3 through three steps. In each step, we need to call a function with the previous result and an argument. If we manipulate data like this, there will be many intermediate results, and sometimes, it consumes a lot of memory when the data is large:

d1 <- f1(d0, arg1) 
d2 <- f2(d1, arg2) 
d3 <- f3(d2, arg3) 

If we want to avoid intermediate results, we'll have to write nested calls. This task does not look straightforward at all, especially when there are numerous arguments in each function call:

f3(f2(f1(d0, arg1), arg2), arg3) 

Using the pipeline operator, the workflow can be rearranged as follows:

d0 %>% 
  f1(arg1) %>% 
  f2(arg2) %>% 
  f3(arg3) 

The code looks much cleaner and straightforward. The whole expression not only looks like a pipeline but also works like a pipeline. The d0 %>% f1(arg1) equation is evaluated as f1(d0, arg1), which is sent to f2(., arg2), which is sent to f3(., arg3). The output of each step becomes the input of the next step.

Therefore, the pipeline operator not only works with dplyr functions, but also works with all other functions. Suppose we want to make a density plot of the diamond prices:

data(diamonds, package = "ggplot2") 
plot(density(diamonds$price, from = 0), 
  main = "Density plot of diamond prices") 

The plot generated is shown as follows:

Using dplyr pipelines to manipulate data frames

Using the pipeline operator, we can rewrite code as follows:

diamonds$price %>% 
  density(from = 0) %>% 
  plot(main = "Density plot of diamonds prices") 

Like data.tabledplyr also supplies do() to perform arbitrary operation on each group of data. For example, we can group diamonds by cut, and for each group, we can fit a linear model of log(price) ~ carat. Different from data.table, we need to specify the names of such operations so that the results can be stored as columns. Also, the expression in do() is not directly evaluated in the context of the grouped data. Instead, we need to use . to represent the data:

models <- diamonds %>% 
  group_by(cut) %>% 
  do(lmod = lm(log(price) ~ carat, data = .)) 
models 
## Source: local data frame [5 x 2] 
## Groups: <by row> 
##  
##         cut    lmod 
##      (fctr)   (chr) 
## 1      Fair <S3:lm> 
## 2      Good <S3:lm> 
## 3 Very Good <S3:lm> 
## 4   Premium <S3:lm> 
## 5     Ideal <S3:lm> 

Note that a new column lmod is created. It is not a typical data column of atomic vectors. Rather, it is a list of linear model objects, that is, the model for each value of cut is stored in the list-typed column lmod. We can access each model using an index:

models$lmod[[1]] 
##  
## Call: 
## lm(formula = log(price) ~ carat, data = .) 
##  
## Coefficients: 
## (Intercept)        carat   
##       6.785        1.251 

The do() function can be very helpful to perform highly customized operations. For example, suppose we need to analyze toy_tests data by summarizing the quality and durability for each product. Consider what we should do if, we only need the top three test records with most samples, and the quality and durability of each product should be a weighted average of the measure and the sample.

Using dplyr functions and pipeline, the preceding task can be easily done with the following code:

toy_tests %>% 
  group_by(id) %>% 
  arrange(desc(sample)) %>% 
  do(head(., 3)) %>% 
  summarize( 
    quality = sum(quality * sample) / sum(sample),  
    durability = sum(durability * sample) / sum(sample)) 
## Source: local data frame [2 x 3] 
##  
##      id  quality durability 
##   (chr)    (dbl)      (dbl) 
## 1   T01 9.319149   9.382979 
## 2   T02 9.040000   8.340000 

Note that when the data is grouped, all subsequent operations are performed by group. To see the intermediate result, we will run the code before do(head(., 3)):

toy_tests %>% 
  group_by(id) %>% 
  arrange(desc(sample)) 
## Source: local data frame [8 x 5] 
## Groups: id [2] 
##  
##      id     date sample quality durability 
##   (chr)    (int)  (int)   (int)      (int) 
## 1   T01 20160405    180       9         10 
## 2   T01 20160302    150      10          9 
## 3   T01 20160502    140       9          9 
## 4   T01 20160201    100       9          9 
## 5   T02 20160403     90       9          8 
## 6   T02 20160502     85      10          9 
## 7   T02 20160303     75       8          8 
## 8   T02 20160201     70       7          9 

We get all records sorted by sample in descending order. Then, do(head(., 3)) will evaluate head(., 3) for each group where . represents the data in the group:

toy_tests %>% 
  group_by(id) %>% 
  arrange(desc(sample)) %>% 
  do(head(., 3)) 
## Source: local data frame [6 x 5] 
## Groups: id [2] 
##  
##      id     date sample quality durability 
##   (chr)    (int)  (int)   (int)      (int) 
## 1   T01 20160405    180       9         10 
## 2   T01 20160302    150      10          9 
## 3   T01 20160502    140       9          9 
## 4   T02 20160403     90       9          8 
## 5   T02 20160502     85      10          9 
## 6   T02 20160303     75       8          8 

Now, we will get the top three records with most samples. It is handy to summarize the data as supposed.

The dplyr function defines a very intuitive grammar of data manipulation and provides high-performance verb functions that are designed for use in pipeline. To learn more, I recommend that you read the package vignettes (https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) and visit the interactive tutorial (https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial) at DataCamp.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.78.83