Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_7

7. Functional Programming: purrr

Thomas Mailund¹

(1)

Aarhus, Denmark

A pipeline-based approach to data processing necessitates a functional programming approach. After all, pipelines are compositions of functions, and loops and variable assignment do not work well with pipelines . You are not precluded from imperative programming, but you need to wrap it in functions.

The package purrr makes functional programming easier. As with the other packages in this book, it will be loaded if you import tidyverse, but you can load it explicitly with

library(purrr)

I will not describe all the functions in the purrr package , there are many and more could be added between the time I write this and the time you read it, but I will describe the functions you will likely use often.

Many of the examples in this chapter are so trivial that you would not use purrr for them. Many reduce to vector expressions, and vector expressions are both more straightforward to read and faster to evaluate. However, applications where you cannot use vector expressions are more involved, and I do not want the examples to overshadow the syntax.

General Features of purrr Functions

The functions in the purrr package are “high-order” functions, which means that the functions either take other functions as input or return functions when you call them (or both). Most of the functions take data as their first argument and return modified data. Thus, they are immediately usable with pipelines. As additional arguments, they will accept one or more functions that specify how you want to translate the input into the output.

Other functions do not transform data but instead modify functions. These are used, so you do not need to write functions for the data transformation functions explicitly; you can use a small set of functions and adapt them where needed.

Filtering

One of the most straightforward functional programming patterns is filtering . Here, you use a high-order filter function. This function takes a predicate function, that is, a function that returns a logical value, and then returns all elements where the predicate evaluates to TRUE (the function keep()) or all elements where the predicate evaluates to FALSE (the function discard()).

is_even <- function(x) x %% 2 == 0

1:6 |> keep(is_even)

## [1] 2 4 6

1:6 |> discard(is_even)

## [1] 1 3 5

When you work with predicates, the negate() function can come in handy. It changes the truth value of your predicate, such that if your predicate, p, returns TRUE, then negate(p) returns FALSE and vice versa.

1:6 |> keep(negate(is_even))

## [1] 1 3 5

1:6 |> discard(negate(is_even))

## [1] 2 4 6

Since you already have complementary functions in keep() and discard(), you would not use negate() for filtering, though.

A special-case function, compact(), removes NULL elements from a list.

y <- list(NULL, 1:3, NULL)

y |> compact()

## [[1]]

## [1] 1 2 3

If you access attributes on objects, and those are not set, R will give you NULL as a result, and for such cases, compact() can be useful.

If we define lists

x <- y <- 1:3

names(y) <- c("one", "two", "three")

then y’s values will be named, while x’s will not. If we try to get the names from x, we will get NULL.

names(x)

## NULL

names(y)

## [1] "one" "two" "three"

Using compact() with names(), we will discard x.

z <- list(x = x, y = y)

z |> compact(names)

## $y

## one two three

## 1 2 3

Mapping

Mapping a function, f, over sequences x = x ₁, x ₂, …, x _n returns a new sequence of the same length as the input but where each element is an application of the function: f(x ₁), f(x ₂), …, f(x _n).

The function map() does this and returns a list as output. Lists are the generic sequence data structure in R since they can hold all types of R objects.

is_even <- function(x) x %% 2 == 0

1:4 |> map(is_even)

## [[1]]

## [1] FALSE

## [[2]]

## [1] TRUE

## [[3]]

## [1] FALSE

## [[4]]

## [1] TRUE

Often, we want to work with vectors of specific types. For all the atomic types, purrr has a specific mapping function. The function for logical values is named map_lgl() .

1:4 |> map_lgl(is_even)

## [1] FALSE TRUE FALSE TRUE

With something as simple as this example, you should not use purrr. Vector expressions are faster and easier to use.

1:4 %% 2 == 0

## [1] FALSE TRUE FALSE TRUE

You cannot always use vector expressions, however. Say you want to sample n elements from a normal distribution, for a sequence of different n values, and then calculate the standard error of the mean. The vector expression

sd(rnorm(n = n)) / sqrt(n) # not SEM!

does not compute this. If you give rnorm() a sequence for its n parameter, it takes the length of the input as the number of elements to sample.

n <- seq.int(100, 1000, 300) # the different n value we want

## [1] 100 400 700 1000

# Are we sampling for each of our n values?

rnorm(n) # no, but length(n) is used for the number of samples in rnorm()

## [1] 0.6691846 -0.4554893 -0.8175801 1.7794664

When you have a function that is not vectorized, you can use a map function to apply it on all elements in a list.

sem <- function(n) sd(rnorm(n = n)) / sqrt(n)

n |> map_dbl(sem)

## [1] 0.10197308 0.05138140 0.03673840 0.03057261

Here, I used map_dbl() to get doubles (numerics) as output. The functions map_chr() and map_int() will give you strings and integers instead.

1:3 %>% map_dbl(identity) %T>% print() %>% class()

## [1] 1 2 3

## [1] "numeric"

1:3 %>% map_chr(identity) %T>% print() %>% class()

## [1] "1" "2" "3"

## [1] "character"

1:3 %>% map_int(identity) %T>% print() %>% class()

## [1] 1 2 3

## [1] "integer"

The different map functions will give you an error if the function you apply does not return values of the type the function should. Sometimes, a map function will do type conversion, as before, but not always. It will usually be happy to convert in a direction that doesn’t lose information, for example, from logical to integer and integer to double, but not in the other direction.

For example, map_lgl() will not convert from integers to logical if you give it the identity() function, even though you can convert these types. Similarly, you can convert strings to integers using as.numeric(), but map_dbl() will not do so if we give it identity(). The mapping functions are more strict than the base R conversion functions. So you should use functions that give you the right type.

The map_dfr() and map_dfc() functions return data frames (tibbles). The function you map should return data frames, and these will be combined, row-wise with map_dfr() and column-wise with map_dfc().

x <- tibble(a = 1:2, b = 3:4)

list(a = x, b = x) |> map_dfr(identity)

## # A tibble: 4 × 2

## a b

## <int> <int>

## 1 1 3

## 2 2 4

## 3 1 3

## 4 2 4

list(a = x, b = x) |> map_dfc(identity)

## New names:

## · `a` -> `a...1`

## · `b` -> `b...2`

## · `a` -> `a...3`

## · `b` -> `b...4`

## # A tibble: 2 × 4

## a...1 b...2 a...3 b...4

## <int> <int> <int> <int>

## 1 1 3 1 3

## 2 2 4 2 4

The map_df() function does the same as map_dfr():

list(a = x, b = x) |> map_df(identity)

## # A tibble: 4 × 2

## a b

## <int> <int>

## 1 1 3

## 2 2 4

## 3 1 3

## 4 2 4

You do not need to give the data frame functions data frames as input, as long as the function you apply to the input returns data frames . This goes for all the map functions. They will accept any sequence input; they only restrict and convert the output.

If the items you map over are sequences themselves, you can extract elements by index; you do not need to provide a function to the map function.

x <- list(1:3, 4:6)

x |> map_dbl(1)

## [1] 1 4

x |> map_dbl(3)

## [1] 3 6

If the items have names, you can also extract values using these.

x <- list(

c(a = 42, b = 13),

c(a = 24, b = 31)

)

x |> map_dbl("a")

## [1] 42 24

x |> map_dbl("b")

## [1] 13 31

This is mostly used when you map over data frames .

a <- tibble(foo = 1:3, bar = 11:13)

b <- tibble(foo = 4:6, bar = 14:16)

ab <- list(a = a, b = b)

ab |> map("foo")

## $a

## [1] 1 2 3

## $b

## [1] 4 5 6

Related to extracting elements by name or index, you can apply functions to different depths of the input using map_depth() . Depth zero is the list itself, so mapping over this depth is the same as applying the function directly on the input.

ab |> map_depth(0, length)

## [1] 2

ab |> length()

## [1] 2

Depth 1 gives us each element in the sequence, so this behaves like a normal map. Depth 2 provides us with a map over the nested elements. Consider the list ab earlier. The top level, depth 0, is the list. Depth 1 is the data frames a and b. Depth 2 is the columns in these data frames. Depth 3 is the individual items in these columns.

ab |> map_depth(1, sum) |> unlist()

## a b

## 42 60

ab |> map_depth(2, sum) |> unlist()

## a.foo a.bar b.foo b.bar

## 6 36 15 45

ab |> map_depth(3, sum) |> unlist()

## a.foo1 a.foo2 a.foo3 a.bar1 a.bar2 a.bar3 b.foo1

## 1 2 3 11 12 13 4

## b.foo2 b.foo3 b.bar1 b.bar2 b.bar3

## 5 6 14 15 16

If you only want to apply a function to some of the elements, you can use map_if() . It takes a predicate and a function and applies the function to those elements where the predicate is true. It returns a list, but you can convert it if you want another type.

is_even <- function(x) x %% 2 == 0

add_one <- function(x) x + 1

1:6 |> map_if(is_even, add_one) |> as.numeric()

## [1] 1 3 3 5 5 7

Notice that this is different from combining filtering and mapping; that combination would remove the elements that do not satisfy the predicate.

1:6 |> keep(is_even) |> map_dbl(add_one)

## [1] 3 5 7

With map_if() , you keep all elements, but the function is only applied to some of them.

If you want to apply one function to the elements where the predicate is true and another to the elements where it is false, you can prove a function to the .else element:

add_two <- function(x) x + 2

1:6 |>

map_if(is_even, add_one, .else = add_two) |>

as.numeric()

## [1] 3 3 5 5 7 7

If you know which indices you want to apply the function to, instead of a predicate they must satisfy, you can use map_at(). This function takes a sequence of indices instead of the predicate but otherwise works the same as map_if() .

1:6 |> map_at(2:5, add_one) |> as.numeric()

## [1] 1 3 4 5 6 6

If you map over a list, x, then your function will be called with the elements in the list, x[[i]]. If you want to get the elements wrapped in a length-one list, that is, use indexing x[i], you can use lmap().

list(a = 1:3, b = 4:6) |> map(print) |> invisible()

## [1] 1 2 3

## [1] 4 5 6

list(a = 1:3, b = 4:6) |> lmap(print) |> invisible()

## $a

## [1] 1 2 3

## $b

## [1] 4 5 6

The function you apply must always return a list, and lmap() will concatenate them.

f <- function(x) list("foo")

1:2 |> lmap(f)

## [[1]]

## [1] "foo"

## [[2]]

## [1] "foo"

f <- function(x) list("foo", "bar")

1:2 |> lmap(f)

## [[1]]

## [1] "foo"

## [[2]]

## [1] "bar"

## [[3]]

## [1] "foo"

## [[4]]

## [1] "bar"

For example, while you can get the length of elements in a list using map() and length()

list(a = 1:3, b = 4:8) |> map(length) |> unlist()

## a b

## 3 5

you will get an error if you try the same using lmap(). This is because length() returns a numeric and not a list. You need to wrap length() with list() so the result is the length in a (length one) list.

wrapped_length <- function(x) {

x |> length() |> # get the length of x (result will be numeric)

list() |> # go back to a list

set_names(names(x)) # and give it the original name

}

list(a = 1:3, b = 4:8) |> lmap(wrapped_length) |> unlist()

## a b

## 1 1

If it surprises you that the lengths are one here, remember that the function is called with the length-one lists at each index. If you want the length of what they contain, you need to extract that.

wrapped_length <- function(x) {

x |> pluck(1) |> # pluck(x, 1) from purrr does that same as x[[1]]

length() |> # now we have the underlying vector and get the length

list() |> # go back to a list

set_names(names(x)) # and give it the original name

}

list(a = 1:3, b = 4:8) %>% lmap(wrapped_length) %>% unlist()

## a b

## 3 5

If you want to extract the nested data, though, you probably want map() and not lmap().

The functions lmap_if() and lmap_at() work as map_if() and map_at() except for how they index the input and handle the output as lists to be concatenated.

Sometimes, we only want to call a function for its side effect. In that case, you can pipe the result of a map into invisible(). The function walk() does that for you, and using it makes it explicit that this is what you want, but it is simply syntactic sugar for map() + invisible() .

1:3 |> map(print) |> invisible()

## [1] 1

## [1] 2

## [1] 3

1:3 |> walk(print)

## [1] 1

## [1] 2

## [1] 3

If you need to map over multiple sequences, you have two choices of map functions to choose from. Some functions map over exactly two sequences. For each of the map() functions, there are similar map2() functions. These take two sequences as the first two arguments.

x <- 1:3

y <- 3:1

map2_dbl(x, y, `+`)

## [1] 4 4 4

You can also create lists of sequences and use the pmap() functions .

list(x, y) |> pmap_dbl(`+`)

## [1] 4 4 4

There are the same type-specific versions as there are for map() and map2(), but with the pmap() functions, you can map over more than one or two input sequences.

z <- 4:6

f <- function(x, y, z) x + y – z

list(x, y, z) |> pmap_dbl(f)

## [1] 0 -1 -2

If you need to know the indices for each value you map over, you can use the imap() variations. When you use these to map over a sequence, your function needs to take two arguments where the first argument is the sequence value and the second the value’s index in the input.

x <- c("foo", "bar", "baz")

f <- function(x, i) paste0(i, ": ", x)

x |> imap_chr(f)

## [1] "1: foo" "2: bar" "3: baz"

There is yet another variant of the mapping functions, the modify() functions . These do not have the type variants (but the _at, _if, _depth, and so on); instead, they will always give you an output of the same type as the input:

modify2(1:3, 3:1, `+`)

## [1] 4 4 4

x <- c("foo", "bar", "baz")

f <- function(x, i) paste0(i, ": ", x)

x |> imodify(f)

## [1] "1: foo" "2: bar" "3: baz"

Reduce and Accumulate

If you want to summarize all your input into a single value, you probably want to reduce() them. Reduce repeatedly applies a function over your input sequence. If you have a function of two arguments, f(a, x), and a sequence x ₁, x ₂, …, x _n, then reduce(f) will compute f(… f(f(x₁, x₂), x₃), …, x_n), that is, it will be called on the first two elements of the sequence, the result will be paired with the next element, and so forth. Think of the argument a as an accumulator that keeps the result of the calculation so far.

To make the order of function application clear, I define a “pair” type:

pair <- function(first, second) {

structure(list(first = first, second = second),

class = "pair")

}

toString.pair <- function(x, ...) {

first <- toString(x$first, ...)

rest <- toString(x$second, ...)

paste('[', first, ', ', rest, ']', sep = '')

}

print.pair <- function(x, ...) {

x |> toString() |> cat() |> invisible()

}

If we reduce using pair(), we see how the values are paired when the function is called:

1:4 |> reduce(pair)

## [[[1, 2], 3], 4]

If you reverse the input, you can reduce in the opposite order, combining the last pair first and propagating the accumulator in that order.

1:4 |> rev() |> reduce(pair)

## [[[4, 3], 2], 1]

If, for some reason, you want to apply the function and have the accumulator as the last argument, you can use the .dir = “backward” argument.

1:4 |> reduce(pair, .dir = "backward")

## [1, [2, [3, 4]]]

The first (or last) element in the input does not have to be the value for the initial accumulator. If you want a specific starting value, you can pass that to reduce() using the .init argument.

1:3 |> reduce(pair, .init = 0)

## [[[0, 1], 2], 3]

1:3 |> rev() |> reduce(pair, .init = 4)

## [[[4, 3], 2], 1]

1:3 |> reduce(pair, .init = 4, .dir = "backward")

## [1, [2, [3, 4]]]

If your function takes more than one argument, you can provide the additional arguments to reduce() and then input sequence and function. Consider, for example, a three-argument function like this:

# additional arguments

loud_pair <- function(acc, next_val, volume) {

# Build a pair

ret <- pair(acc, next_val)

# Announce that pair to the world

ret |> toString() |>

paste(volume, ' ', sep = '') |>

cat()

# Then return the new pair

ret

}

It builds a pair object but, as a side effect, prints the pair followed by a string that indicates how “loud” the printed value is. We can provide the volume as an extra argument to reduce():

1:3 |>

reduce(loud_pair, volume = '!') |>

invisible()

## [1, 2]!

## [[1, 2], 3]!

1:3 |>

reduce(loud_pair, volume = '!!') |>

invisible()

## [1, 2]!!

## [[1, 2], 3]!!

If you want to reduce two sequences instead of one—similar to a second argument to reduce() but a sequence instead of a single value—you can use reduce2():

volumes <- c('!', '!!')

1:3 |> reduce2(volumes, loud_pair) |> invisible()

## [1, 2]!

## [[1, 2], 3]!!

1:3 |>

reduce2(c('!', '!!', '!!!'), .init = 0, loud_pair) |>

invisible()

## [0, 1]!

## [[0, 1], 2]!!

## [[[0, 1], 2], 3]!!!

If you want all the intermediate values of the reductions, you can use the accumulate() function . It returns a sequence of the results of each function application.

res <- 1:3 |> accumulate(pair)

print(res[[1]])

## [1] 1

print(res[[2]])

## [1, 2]

print(res[[3]])

## [[1, 2], 3]

res <- 1:3 |> accumulate(pair, .init = 0)

print(res[[1]])

## [1] 0

print(res[[4]])

## [[[0, 1], 2], 3]

res <- 1:3 |> accumulate(

pair, .init = 0,

.dir = "backward"

)

print(res[[1]])

## [1, [2, [3, 0]]]

print(res[[4]])

## [1] 0

The accumulate2() function works like reduce2(), except that it keeps the intermediate values like accumulate() does.

Partial Evaluation and Function Composition

When you filter, map, or reduce over sequences, you sometimes want to modify a function to match the interface of purrr’s functions. If you have a function that takes too many arguments for the interface, but where you can fix some of the parameters to get the application you want, you can do what is called a partial evaluation . This just means that you create a new function that calls the original function with some of the parameters fixed.

For example, if you filter, you want a function that takes one input value and returns one (Boolean) output value. If you want to filter the values that are less than or greater than, say, three, you can create functions for this.

greater_than_three <- function(x) 3 < x

less_than_three <- function(x) x < 3

1:6 |> keep(greater_than_three)

## [1] 4 5 6

1:6 |> keep(less_than_three)

## [1] 1 2

The drawback of doing this is that you might need to define many such functions, even if you only use each once in your pipeline.

Using the partial() function, you can bind parameters without explicitly defining new functions. For example, to bind the first parameter to <, as in the greater_than_three() function, you can use partial():

1:6 |> keep(partial(`<`, 3))

## [1] 4 5 6

By default, you always bind the first parameter(s). To bind others, you need to name which parameters to bind. The less than operator has these parameter names:

`<`

## function (e1, e2) .Primitive("<")

so you can use this partial evaluation for less_than_three():

1:6 |> keep(partial(`<`, e2 = 3))

## [1] 1 2

Similarly, you can use partial evaluation for mapping:

1:6 |> map_dbl(partial(`+`, 2))

## [1] 3 4 5 6 7 8

1:6 |> map_dbl(partial(`-`, 1))

## [1] 0 -1 -2 -3 -4 -5

1:3 |> map_dbl(partial(`-`, e1 = 4))

## [1] 3 2 1

1:3 |> map_dbl(partial(`-`, e2 = 4))

## [1] -3 -2 -1

If you need to apply more than one function, for example

1:3 |>

map_dbl(partial(`+`, 2)) |>

map_dbl(partial(`*`, 3))

## [1] 9 12 15

you can also simply combine the functions. The function composition, °, works as this: (g ° f)(x) = g(f(x)).

So the pipeline earlier can also be written:

1:3 |> map_dbl(

compose(partial(`*`, 3), partial(`+`, 2))

)

## [1] 9 12 15

With partial() and combine(), you can modify functions, but using them does not exactly give you code that is easy to read. A more readable alternative is using lambda expressions.

Lambda Expressions

Lambda expressions are a concise syntax for defining anonymous functions, that is, functions that we do not name. The name “lambda expressions” comes from “lambda calculus,”¹ a discipline in formal logic, but in computer science, it is mostly used as a synonym for anonymous functions. In some programming languages, you cannot create anonymous functions; you need to name a function to define it. In other languages, you have special syntax for lambda expressions. In R, you define anonymous functions the same way that you define named functions. You always define functions the same way; you only give them a name when you assign a function definition to a variable.

If this sounds too theoretical, consider this example. When we filtered values that are even, we defined a function, is_even(), to use as a predicate.

is_even <- function(x) x %% 2 == 0

1:6 |> keep(is_even)

## [1] 2 4 6

We defined the function using the expression function(x) x %% 2 == 0, and then we assigned the function to the name is_even. Instead of assigning the function to a name, we could use the function definition directly in the call to keep():

1:6 |> keep(function(x) x %% 2 == 0)

## [1] 2 4 6

R 4 introduced new syntax for defining functions, giving us a slightly shorter form:

1:6 |> keep((x) x %% 2 == 0)

## [1] 2 4 6

that I tend to use in situations such as this, but this is entirely a question of taste. There is nothing wrong with the function(…) … syntax.

So, R already has lambda expressions, but the syntax can be verbose, especially before R 4 and the (...) ... syntax. In purrr, you can use a formula as a lambda expression instead of a function. You can define an is_even() function using a formula like this:

is_even_lambda <- ~ .x %% 2 == 0

1:6 |> keep(is_even_lambda)

## [1] 2 4 6

or use the formula directly in your pipeline.

1:6 |> keep(~ .x %% 2 == 0)

## [1] 2 4 6

The variable .x in the expression will be interpreted as the first argument to the lambda expression.

Lambda expressions are not an approach you can use to define functions in general. They only work because purrr functions understand formulae as function specifications. You cannot write

is_even_lambda <- ~ .x %% 2 == 0

is_even_lambda(3)

This will give you an error. The error message is not telling you that you try to use a formula as a function, unfortunately, but just that it cannot find the function is_even_lambda. This is because R will look for a function when you call a variable as a function and ignore other variables with the same name. If you reassign to a variable

f <- function(x) 2 * x

f <- 5

f(2)

## Error in f(2): could not find function "f"

you are told that it cannot find a function with the name you call—not that the variable does not refer to a nonfunction. That is the error you get if you attempt to call a lambda expression outside a purrr function (or certain other Tidyverse functions ).

R’s rule for looking for functions can be even more confusing if there is a variable in an inner scope and a function in an outer scope:

f <- function(x) 2 * x

g <- function() {

f <- 5 # not a function

f(2) # will look for a function

}

g()

## [1] 4

Here, the f() function in the outer scope is called because it is a function; the variable in the inner scope is ignored.

Getting back to lambda expressions in purrr functions, you can use them as more readable versions of partial evaluation.

1:4 |> map_dbl(~ .x / 2)

## [1] 0.5 1.0 1.5 2.0

1:3 |> map_dbl(~ 2 + .x)

## [1] 3 4 5

1:3 |> map_dbl(~ 4 - .x)

## [1] 3 2 1

1:3 |> map_dbl(~ .x - 4)

## [1] -3 -2 -1

Or you can use them for more readable versions of function composition.

1:3 |> map_dbl(~ 3 * (.x + 2))

## [1] 9 12 15

If you need a lambda expression with two arguments, you can use .x and .y as the first and second arguments, respectively.

map2_dbl(1:3, 1:3, ~ .x + .y)

## [1] 2 4 6

If you need more than two arguments, you can use .n for the nth argument:

list(1:3, 1:3, 1:3) %>% pmap_dbl(~ .1 + .2 + .3)

## [1] 0.6 0.6 0.6

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Functional Programming: purrr

Create new playlist

Sign In

Sign Up

7. Functional Programming: purrr

General Features of purrr Functions

Filtering

Mapping

Reduce and Accumulate

Partial Evaluation and Function Composition

Lambda Expressions

Table of Contents for
7. Functional Programming: purrr