Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_6

6. Pipelines: magrittr

Thomas Mailund¹

(1)

Aarhus, Denmark

Data analysis consists of several steps, where your data moves through different stages of transformations and cleaning before you finally get to model construction. In practical terms, this means that your R code will consist of a series of function calls where the output of one is the input of the next. The pattern is typical, but a straightforward implementation of it has several drawbacks. The Tidyverse has for many years provided a “pipe operator” to alleviate this, and with R 4.1, there is also a built-in operator in the language itself. Although this chapter is mainly about the magrittr pipe operator , %>%, I will describe the native |> operator as well and point out a few places where they differ in their behavior.

The native operator, |>, is readily available in any R version greater than 4.1.0. You can get the pipe operator implemented in the magrittr package if you load the tidyverse package:

library(tidyverse)

or explicitly load the package:

library(magrittr)

If you load magrittr through the tidyverse package, you will get the most common pipe operator, %>%, but not alternative ones—see the following section. For those, you need to load magrittr.

The Problem with Pipelines

Consider this data from Chapter 5.

write_csv(

tribble(

~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,

"Numenor", 123456, 132654, NA, 324156,

"Westeros", 314256, NA, NA, 465321,

"Narnia", 432156, NA, NA, NA,

"Gondor", 531426, 321465, 235461, 463521,

"Laputa", 14235, 34125, 45123, 51234,

"data/income.csv"

)

I have written it to a file, data/income.csv, to make the following examples easier to read.

In Chapter 5, we reformatted and cleaned this data using pivot_longer() and drop_na() :

mydata <- read_csv("data/income.csv",

col_types = "cdddd")

mydata <- pivot_longer(

data = mydata,

names_to = "year",

values_to = "mean_income",

!country

)

mydata <- drop_na(mydata)

This is a typical pipeline, although a short one. It consists of three steps: (1) read the data into R, (2) tidy the data, and (3) clean the data.

There is nothing inherently wrong with code like this. Each step is easy to read; it is a function call that transforms your data from one form to another. The pipeline nature of the code, however, is not explicit. We can read the code and see that the output of one function is the input of the next, but what if it isn’t?

Consider this:

mydata <- read_csv("data/income.csv",

col_types = "cdddd")

data <- pivot_longer(...) # Are we assigning to the right name?

mydata <-drop_na(mydata) # Input from read_csv() not pivot_longer()

The result of pivot_longer() is assigned to the variable data, and then drop_na() is called on mydata. If our intent were the preceding pipeline, this would be a mistake. But maybe we wanted the result of pivot_longer() to be a separate data table needed for some downstream analysis.

A series of transformation steps that constitute a pipeline can only be recognized as such by following programming conventions, and unless the code is well documented and well understood, we cannot immediately see from the code whether assigning to data instead of mydata is an error.

We can make pipelines more explicit. If the output of one function should be the input of the next, we can nest the function calls:

mydata <- drop_na(

pivot_longer(

data = read_csv("data/income.csv",

col_types = "cdddd"),

names_to = "year",

values_to = "mean_income",

!country

)

This makes the pipeline intention explicit, but the code very hard to read; you have to work out the pipeline from the innermost function call to the outermost.

The magrittr package gives us notation for explicitly describing specifying readable pipelines.

Pipeline Notation

The pipeline operators, |> and %>%, introduce syntactic sugar for function calls.¹ The code

x %>% f()

x |> f()

is equivalent to

f(x)

and

x %>% f() %>% g() %>% h()

x |> f() |> g() |> h()

is equivalent to

h(g(f(x)))

So, by the way, is

x %>% f %>% g %>% h

but not

x |> f |> g |> h

The native operator needs to have a function call, f(), on the right-hand side of the operator, while the magrittr will accept either a function call, f(), or just a function name, x %>% f. You can also give %>% an anonymous function:

"hello, "%>% ((y){ paste(y, "world!")})

## [1] "hello, world!"

but |> still needs a function call, so if you give it an anonymous function, you must call the function as well (add () after the function definition).

"hello," |> ((y){ paste(y, "world!")})()

## [1] "hello, world!"

In either case, you must put the function definition in parentheses. This will be an error:

x |> (y) { ... }

This is because both operators are syntactic sugar that translates x %>% f(), x %>% f, or x |> f() into f(x) and the function definition, function(y) { ... } or (y) { ... }, is a function call. The left-hand side of the pipeline is put into the function definition:

x |> (y) { ... }

is translated into

(x, f) { ... }

for both operators. Putting the function definition in parentheses prevents this; while function(y) { ... } is a function call (to the function called function), (function(y) { ... }) isn’t. That is, when the function definition is in parentheses, it is no longer syntactically a function call (but an expression in parentheses). If it is not a call, |> will not accept it as the right-hand side, but %>% will. For %>%, it still isn’t a call, just something that evaluates to one, so the function definition isn’t modified, but the result gets the left-hand side as an argument.

Anyway, that got technical, but it is something you need to know if you use anonymous functions in pipelines.

With the %>% operator, you do not need the parentheses for the function calls, but most prefer them to make it clear that we are dealing with functions, and the syntax is then still valid if you change the code to the |> operator. Also, if your functions take more than one argument, parentheses are needed, so if you always include them, it gives you a consistent notation.

The nested function call pipeline from earlier

mydata <- drop_na(pivot_longer(read_csv(. . . )))

can thus be rewritten in pipe form

mydata <- read_csv(. . . ) %>% pivot_longer(. . . ) %>% drop_na()

or, with all the arguments included, as

mydata <- read_csv("data/income.csv", col_types = "cdddd") %>%

pivot_longer(names_to = "year", values_to = "mean_income", !country) %>%

drop_na()

The pipeline notation combines the readability of single function call steps with explicit notation for data processing pipelines.

Pipelines and Function Arguments

The pipe operator is left-associative, that is, it evaluates from left to right, and the expression lhs %>% rhs expects the left-hand side (lhs) to be data and the right-hand side (rhs) to be a function or a function call (and always a function call for |>). The result will be rhs(lhs). The output of this function call is the left-hand side of the next pipe operator in the pipeline or the result of the entire pipeline. If the right-hand side function takes more than one argument, you provide these in the rhs expression. The expression

lhs %>% rhs(x, y, z)

will be evaluated as the function call

rhs(lhs, x, y, z)

By default, the left-hand side is given as the first argument to the right-hand side function. The arguments that you explicitly write on the right-hand side expression are the additional function parameters.

With the magrittr operator, %>%, but not the |> operator, you can change the default input position using the special variable . (dot). If the right-hand side expression has . as a parameter, then that is where the left-hand side value goes. All the following four pipelines are equivalent:

mydata <- read_csv("data/income.csv",

col_types = "cdddd") |>

pivot_longer(names_to = "year", values_to = "mean_income", !country) |>

drop_na()

mydata <- read_csv("data/income.csv",

col_types = "cdddd") %>%

pivot_longer(names_to = "year", values_to = "mean_income", !country) %>%

drop_na()

mydata <- read_csv("data/income.csv",

col_types = "cdddd") %>%

pivot_longer(data = ., names_to = "year", values_to = "mean_income", !country) %>%

drop_na()

mydata <- read_csv("data/income.csv",

col_types = "cdddd") %>%

pivot_longer(names_to = "year", values_to = "mean_income", !country, data = .) %>%

drop_na()

You can use a dot more than once on the right-hand side:

rnorm(5) %>% tibble(x = ., y = .)

## # A tibble: 5 × 2

## x y

## <dbl> <dbl>

## 1 -0.951 -0.951

## 2 0.0206 0.0206

## 3 1.14 1.14

## 4 -0.172 -0.172

## 5 0.0795 0.0795

You can also use it in nested function calls.

rnorm(5) %>% tibble(x = ., y = abs(.))

## # A tibble: 5 × 2

## x y

## <dbl> <dbl>

## 1 1.61 1.61

## 2 1.86 1.86

## 3 -0.211 0.211

## 4 -0.00352 0.00352

## 5 -0.522 0.522

If the dot is only found in nested function calls, however, magrittr will still add it as the first argument to the right-hand side function.

rnorm(5) %>% tibble(x = sin(.), y = abs(.))

## # A tibble: 5 × 3

## . x y

## <dbl> <dbl> <dbl>

## 1 1.10 0.893 1.10

## 2 -0.155 -0.155 0.155

## 3 0.102 0.102 0.102

## 4 0.627 0.587 0.627

## 5 0.280 0.276 0.280

To avoid this, you can put the right-hand side expression in curly brackets:

rnorm(5) %>% { tibble(x = sin(.), y = abs(.)) }

## # A tibble: 5 × 2

## x y

## <dbl> <dbl>

## 1 0.661 0.723

## 2 0.0588 0.0588

## 3 0.960 1.86

## 4 0.820 0.961

## 5 -0.366 0.375

In general, you can put expressions in curly brackets as the right-hand side of a pipe operator and have magrittr evaluate them. Think of it as a way to write one-parameter anonymous functions. The input variable is the dot, and the expression inside the curly brackets is the body of the function.

Function Composition

It is trivial to write your own functions such that they work well with pipelines. All functions map input to output (ideally with no side effects) so all functions can be used in a pipeline. If the key input is the first argument of a function you write, the default placement of the left-hand side value will work—so that is preferable—but otherwise you can explicitly use the dot.

When writing pipelines, however, often you do not need to write a function from scratch. If a function is merely a composition of other function calls—it is a pipeline function itself—we can define it as such.

In mathematics, the function composition operator, °, defines the composition of functions f and g, g ° f, to be the function

(g ° f)(x) = g(f(x))

Many functional programming languages encourage you to write functions by combining other functions and have operators for that. It is not frequently done in R, and while you can implement function composition, there is no built-in operator. The magrittr package, however, gives you this syntax:

h <- . %>% f() %>% g()

This defines the function h, such that h(x) = g(f(x)).

If we take the tidy-then-clean pipeline we saw earlier, and imagine that we need to do the same for several input files, we can define a pipeline function for this as

pipeline <- . %>%

pivot_longer(names_to = "year", values_to = "mean_income", !country) %>%

drop_na()

pipeline

## Functional sequence with the following components:

## 1. pivot_longer(., names_to = "year", values_to = "mean_income", !country)

## 2. drop_na(.)

## Use 'functions' to extract the individual functions.

Other Pipe Operations

There are three other pipe operators in magrittr. These are not imported when you import the tidyverse package , so to get access to them, you have to import magrittr explicitly.

library(magrittr)

The %<>% operator is used to run a pipeline starting with a variable and ending by assigning the result of the pipeline back to that variable.

The code

mydata <- read_csv("data/income.csv", col_types = "cdddd")

mydata %<>% pipeline()

is equivalent to

mydata <- read_csv("data/income.csv", col_types ="cdddd")

mydata <- mydata %>% pipeline()

This reassignment operator behaves similarly to the stepwise pipeline convention we considered at the start of the chapter, but it makes explicit that we are updating an existing variable. You cannot accidentally assign the result to a different variable.

If your right-hand side is an expression in curly brackets, you can refer to the input through the dot variable:

mydata <- tibble(x = rnorm(5), y = rnorm(5))

mydata %>% { .$x - .$y }

## [1] 0.3597903 1.8108120 2.3444173 3.1367824

## [5] 0.2507438

If, as here, the input is a data frame and you want to access its columns, you need the notation .$ to access them. The %$% pipe operator opens the data frame for you so you can refer to the columns by name.

mydata %$% { x - y }

## [1] 0.3597903 1.8108120 2.3444173 3.1367824

## [5] 0.2507438

The tee pipe operator, %T>%, behaves like the regular pipe operator, %>%, except that it returns its input rather than the result of calling the function on its right-hand side. The regular pipe operation x %>% f() will return f(x), but the tee pipe operation x %T>% f() will call f(x) but return x. This is useful for calling functions with side effects as a step inside a pipeline, such as saving intermediate results to files or plotting data.

If you call a function in a usual pipeline, you will pass the result of the function call on to the next function in the pipeline. If you, for example, want to plot intermediate data, you might not be so lucky that the plotting function will return the data. The ggplot2 functions will not (see Chapter 13).

If you want to plot intermediate values, you need to save the data in a variable, save it, and then start a new pipeline with the data as input.

tidy_income <- read_csv("data/income.csv", col_types = "cdddd") |>

pivot_longer(names_to = "year", values_to = "mean_income", !country)

# Summarize and visualize data before we continue

summary(tidy_income)

ggplot(tidy_income, aes(x = year, y =mean_income)) + geom_point()

# Continue processing

tidy_income |>

drop_na() |>

write_csv("data/tidy-income.csv")

With the tee operator , you can call the plotting function and continue the pipeline.

mydata <- read_csv("data/income.csv", col_types = "cdddd") %>%

pivot_longer(names_to = "year", values_to = "mean_income", !country) %T>%

# Summarize and then continue

{ print(summary(.)) } %T>%

# Plot and then continue

{ print(ggplot(., aes(x = year, y = mean_income)) + geom_point()) }%>%

drop_na() %>% write_csv("data/tidy-income.csv")

Footnotes

The magrittr pipeline operator is “syntactic sugar” in the sense that it introduces a more readable notation for a function call. In compiled languages, syntactic sugar would be directly translated into the original form and have exactly the same runtime behavior as the sugar-free version. The built-in pipeline operator, |>, available after R 4.1, works this way. When R parses your source code, it will translate an expression such as x |> f() into f(x), and there is practically no difference between using the two syntaxes for calling a function. With magrittr’s operator, however, %>% is just a function as any other R function and calling it incurs a runtime cost. The %>% operator is a complex function, so compared to explicit function calls, it is slow to use it. This is rarely an issue, though, since the expensive data processing is done inside the function and not in function calls. The extra power that magrittr’s operator provides, compared to the new built-in operator, often makes it a better choice. There is nothing wrong with mixing the two operators, however, and use |> for simple (and faster) function calls and %>% for more elaborate expressions. Keep in mind, though, that they do differ in some use cases, so you cannot always use them as drop-in replacements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Pipelines: magrittr

Create new playlist

Sign In

Sign Up

6. Pipelines: magrittr

The Problem with Pipelines

Pipeline Notation

Pipelines and Function Arguments

Function Composition

Other Pipe Operations

Table of Contents for
6. Pipelines: magrittr