© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_6

6. Pipelines: magrittr

Thomas Mailund1  
(1)
Aarhus, Denmark
 

Data analysis consists of several steps, where your data moves through different stages of transformations and cleaning before you finally get to model construction. In practical terms, this means that your R code will consist of a series of function calls where the output of one is the input of the next. The pattern is typical, but a straightforward implementation of it has several drawbacks. The Tidyverse has for many years provided a “pipe operator” to alleviate this, and with R 4.1, there is also a built-in operator in the language itself. Although this chapter is mainly about the magrittr pipe operator , %>%, I will describe the native |> operator as well and point out a few places where they differ in their behavior.

The native operator, |>, is readily available in any R version greater than 4.1.0. You can get the pipe operator implemented in the magrittr package if you load the tidyverse package:
library(tidyverse)
or explicitly load the package:
library(magrittr)

If you load magrittr through the tidyverse package, you will get the most common pipe operator, %>%, but not alternative ones—see the following section. For those, you need to load magrittr.

The Problem with Pipelines

Consider this data from Chapter 5.
write_csv(
    tribble(
        ~country,  ~`2002`, ~`2003`, ~`2004`, ~`2005`,
        "Numenor",  123456,  132654,      NA,  324156,
        "Westeros", 314256,  NA,          NA,  465321,
        "Narnia",   432156,  NA,          NA,      NA,
        "Gondor",   531426,  321465,  235461,  463521,
        "Laputa",    14235,   34125,   45123,   51234,
    ),
    "data/income.csv"
)

I have written it to a file, data/income.csv, to make the following examples easier to read.

In Chapter 5, we reformatted and cleaned this data using pivot_longer() and drop_na() :
mydata <- read_csv("data/income.csv",
                   col_types = "cdddd")
mydata <- pivot_longer(
    data = mydata,
    names_to = "year",
    values_to = "mean_income",
    !country
)
mydata <- drop_na(mydata)

This is a typical pipeline, although a short one. It consists of three steps: (1) read the data into R, (2) tidy the data, and (3) clean the data.

There is nothing inherently wrong with code like this. Each step is easy to read; it is a function call that transforms your data from one form to another. The pipeline nature of the code, however, is not explicit. We can read the code and see that the output of one function is the input of the next, but what if it isn’t?

Consider this:
mydata <- read_csv("data/income.csv",
                   col_types = "cdddd")
data <- pivot_longer(...) # Are we assigning to the right name?
mydata <-drop_na(mydata) # Input from read_csv() not pivot_longer()

The result of pivot_longer() is assigned to the variable data, and then drop_na() is called on mydata. If our intent were the preceding pipeline, this would be a mistake. But maybe we wanted the result of pivot_longer() to be a separate data table needed for some downstream analysis.

A series of transformation steps that constitute a pipeline can only be recognized as such by following programming conventions, and unless the code is well documented and well understood, we cannot immediately see from the code whether assigning to data instead of mydata is an error.

We can make pipelines more explicit. If the output of one function should be the input of the next, we can nest the function calls:
mydata <- drop_na(
    pivot_longer(
        data = read_csv("data/income.csv",
                        col_types = "cdddd"),
        names_to = "year",
        values_to = "mean_income",
        !country
    )
)

This makes the pipeline intention explicit, but the code very hard to read; you have to work out the pipeline from the innermost function call to the outermost.

The magrittr package gives us notation for explicitly describing specifying readable pipelines.

Pipeline Notation

The pipeline operators, |> and %>%, introduce syntactic sugar for function calls.1 The code
x %>% f()
or
x |> f()
is equivalent to
f(x)
and
x %>% f() %>% g() %>% h()
or
x |> f() |> g() |> h()
is equivalent to
h(g(f(x)))
So, by the way, is
x %>% f %>% g %>% h
but not
x |> f |> g |> h
The native operator needs to have a function call, f(), on the right-hand side of the operator, while the magrittr will accept either a function call, f(), or just a function name, x %>% f. You can also give %>% an anonymous function:
"hello, "%>% ((y){ paste(y, "world!")})
## [1] "hello, world!"
but |> still needs a function call, so if you give it an anonymous function, you must call the function as well (add () after the function definition).
"hello," |> ((y){ paste(y, "world!")})()
## [1] "hello, world!"
In either case, you must put the function definition in parentheses. This will be an error:
x |> (y) { ... }
This is because both operators are syntactic sugar that translates x %>% f(), x %>% f, or x |> f() into f(x) and the function definition, function(y) { ... } or (y) { ... }, is a function call. The left-hand side of the pipeline is put into the function definition:
x |> (y) { ... }
is translated into
(x, f) { ... }

for both operators. Putting the function definition in parentheses prevents this; while function(y) { ... } is a function call (to the function called function), (function(y) { ... }) isn’t. That is, when the function definition is in parentheses, it is no longer syntactically a function call (but an expression in parentheses). If it is not a call, |> will not accept it as the right-hand side, but %>% will. For %>%, it still isn’t a call, just something that evaluates to one, so the function definition isn’t modified, but the result gets the left-hand side as an argument.

Anyway, that got technical, but it is something you need to know if you use anonymous functions in pipelines.

With the %>% operator, you do not need the parentheses for the function calls, but most prefer them to make it clear that we are dealing with functions, and the syntax is then still valid if you change the code to the |> operator. Also, if your functions take more than one argument, parentheses are needed, so if you always include them, it gives you a consistent notation.

The nested function call pipeline from earlier
mydata <- drop_na(pivot_longer(read_csv(. . . )))
can thus be rewritten in pipe form
mydata <- read_csv(. . . ) %>% pivot_longer(. . . ) %>% drop_na()
or, with all the arguments included, as
mydata <- read_csv("data/income.csv", col_types = "cdddd") %>%
    pivot_longer(names_to = "year", values_to = "mean_income", !country) %>%
    drop_na()

The pipeline notation combines the readability of single function call steps with explicit notation for data processing pipelines.

Pipelines and Function Arguments

The pipe operator is left-associative, that is, it evaluates from left to right, and the expression lhs %>% rhs expects the left-hand side (lhs) to be data and the right-hand side (rhs) to be a function or a function call (and always a function call for |>). The result will be rhs(lhs). The output of this function call is the left-hand side of the next pipe operator in the pipeline or the result of the entire pipeline. If the right-hand side function takes more than one argument, you provide these in the rhs expression. The expression
lhs %>% rhs(x, y, z)
will be evaluated as the function call
rhs(lhs, x, y, z)

By default, the left-hand side is given as the first argument to the right-hand side function. The arguments that you explicitly write on the right-hand side expression are the additional function parameters.

With the magrittr operator, %>%, but not the |> operator, you can change the default input position using the special variable . (dot). If the right-hand side expression has . as a parameter, then that is where the left-hand side value goes. All the following four pipelines are equivalent:
mydata <- read_csv("data/income.csv",
                   col_types = "cdddd") |>
    pivot_longer(names_to = "year", values_to = "mean_income", !country) |>
    drop_na()
mydata <- read_csv("data/income.csv",
                   col_types = "cdddd") %>%
    pivot_longer(names_to = "year", values_to = "mean_income", !country) %>%
    drop_na()
mydata <- read_csv("data/income.csv",
                   col_types = "cdddd") %>%
    pivot_longer(data = ., names_to = "year", values_to = "mean_income", !country) %>%
    drop_na()
mydata <- read_csv("data/income.csv",
                   col_types = "cdddd") %>%
    pivot_longer(names_to = "year", values_to = "mean_income", !country, data = .) %>%
    drop_na()
You can use a dot more than once on the right-hand side:
rnorm(5) %>% tibble(x = ., y = .)
## # A tibble: 5 × 2
##         x       y
##     <dbl>   <dbl>
## 1 -0.951  -0.951
## 2  0.0206  0.0206
## 3  1.14    1.14
## 4 -0.172  -0.172
## 5  0.0795  0.0795
You can also use it in nested function calls.
rnorm(5) %>% tibble(x = ., y = abs(.))
## # A tibble: 5 × 2
##          x       y
##      <dbl>   <dbl>
## 1  1.61    1.61
## 2  1.86    1.86
## 3 -0.211   0.211
## 4 -0.00352 0.00352
## 5 -0.522   0.522
If the dot is only found in nested function calls, however, magrittr will still add it as the first argument to the right-hand side function.
rnorm(5) %>% tibble(x = sin(.), y = abs(.))
## # A tibble: 5 × 3
##        .      x     y
##    <dbl>  <dbl> <dbl>
## 1  1.10   0.893 1.10
## 2 -0.155 -0.155 0.155
## 3  0.102  0.102 0.102
## 4  0.627  0.587 0.627
## 5  0.280  0.276 0.280
To avoid this, you can put the right-hand side expression in curly brackets:
rnorm(5) %>% { tibble(x = sin(.), y = abs(.)) }
## # A tibble: 5 × 2
##         x      y
##     <dbl>  <dbl>
## 1  0.661  0.723
## 2  0.0588 0.0588
## 3  0.960  1.86
## 4  0.820  0.961
## 5 -0.366  0.375

In general, you can put expressions in curly brackets as the right-hand side of a pipe operator and have magrittr evaluate them. Think of it as a way to write one-parameter anonymous functions. The input variable is the dot, and the expression inside the curly brackets is the body of the function.

Function Composition

It is trivial to write your own functions such that they work well with pipelines. All functions map input to output (ideally with no side effects) so all functions can be used in a pipeline. If the key input is the first argument of a function you write, the default placement of the left-hand side value will work—so that is preferable—but otherwise you can explicitly use the dot.

When writing pipelines, however, often you do not need to write a function from scratch. If a function is merely a composition of other function calls—it is a pipeline function itself—we can define it as such.

In mathematics, the function composition operator, °, defines the composition of functions f and g, g ° f, to be the function

(g ° f)(x) = g(f(x))

Many functional programming languages encourage you to write functions by combining other functions and have operators for that. It is not frequently done in R, and while you can implement function composition, there is no built-in operator. The magrittr package, however, gives you this syntax:
h <- . %>% f() %>% g()

This defines the function h, such that h(x) = g(f(x)).

If we take the tidy-then-clean pipeline we saw earlier, and imagine that we need to do the same for several input files, we can define a pipeline function for this as
pipeline <- . %>%
    pivot_longer(names_to = "year", values_to = "mean_income", !country) %>%
    drop_na()
pipeline
## Functional sequence with the following components:
##
##  1. pivot_longer(., names_to = "year", values_to = "mean_income", !country)
##  2. drop_na(.)
##
## Use 'functions' to extract the individual functions.

Other Pipe Operations

There are three other pipe operators in magrittr. These are not imported when you import the tidyverse package , so to get access to them, you have to import magrittr explicitly.
library(magrittr)

The %<>% operator is used to run a pipeline starting with a variable and ending by assigning the result of the pipeline back to that variable.

The code
mydata <- read_csv("data/income.csv", col_types = "cdddd")
mydata %<>% pipeline()
is equivalent to
mydata <- read_csv("data/income.csv", col_types ="cdddd")
mydata <- mydata %>% pipeline()

This reassignment operator behaves similarly to the stepwise pipeline convention we considered at the start of the chapter, but it makes explicit that we are updating an existing variable. You cannot accidentally assign the result to a different variable.

If your right-hand side is an expression in curly brackets, you can refer to the input through the dot variable:
mydata <- tibble(x = rnorm(5), y = rnorm(5))
mydata %>% { .$x - .$y }
## [1] 0.3597903 1.8108120 2.3444173 3.1367824
## [5] 0.2507438
If, as here, the input is a data frame and you want to access its columns, you need the notation .$ to access them. The %$% pipe operator opens the data frame for you so you can refer to the columns by name.
mydata %$% { x - y }
## [1] 0.3597903 1.8108120 2.3444173 3.1367824
## [5] 0.2507438

The tee pipe operator, %T>%, behaves like the regular pipe operator, %>%, except that it returns its input rather than the result of calling the function on its right-hand side. The regular pipe operation x %>% f() will return f(x), but the tee pipe operation x %T>% f() will call f(x) but return x. This is useful for calling functions with side effects as a step inside a pipeline, such as saving intermediate results to files or plotting data.

If you call a function in a usual pipeline, you will pass the result of the function call on to the next function in the pipeline. If you, for example, want to plot intermediate data, you might not be so lucky that the plotting function will return the data. The ggplot2 functions will not (see Chapter 13).

If you want to plot intermediate values, you need to save the data in a variable, save it, and then start a new pipeline with the data as input.
tidy_income <- read_csv("data/income.csv", col_types = "cdddd") |>
    pivot_longer(names_to = "year", values_to = "mean_income", !country)
# Summarize and visualize data before we continue
summary(tidy_income)
ggplot(tidy_income, aes(x = year, y =mean_income)) + geom_point()
# Continue processing
tidy_income |>
    drop_na() |>
    write_csv("data/tidy-income.csv")
With the tee operator , you can call the plotting function and continue the pipeline.
mydata <- read_csv("data/income.csv", col_types = "cdddd") %>%
    pivot_longer(names_to = "year", values_to = "mean_income", !country) %T>%
    # Summarize and then continue
    { print(summary(.)) } %T>%
    # Plot and then continue
    { print(ggplot(., aes(x = year, y = mean_income)) + geom_point()) }%>%
    drop_na() %>% write_csv("data/tidy-income.csv")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.24.106