A pipeline-based approach to data processing necessitates a functional programming approach. After all, pipelines are compositions of functions, and loops and variable assignment do not work well with pipelines
. You are not precluded from imperative programming, but you need to wrap it in functions.
Mapping
Mapping
a function, f, over sequences x = x
1, x
2, …, x
n returns a new sequence of the same length as the input but where each element is an application of the function: f(x
1), f(x
2), …, f(x
n).
The function
map() does this and returns a
list as output. Lists are the generic sequence data structure in R since they can hold all types of R objects.
is_even <- function(x) x %% 2 == 0
1:4 |> map(is_even)
## [[1]]
## [1] FALSE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] FALSE
##
## [[4]]
## [1] TRUE
Often, we want to work with vectors of specific types. For all the atomic types,
purrr has a specific mapping function. The function for logical values is named
map_lgl()
.
## [1] FALSE TRUE FALSE TRUE
With something as simple as this example, you should not use
purrr.
Vector expressions
are faster and easier to use.
## [1] FALSE TRUE FALSE TRUE
You cannot always use vector expressions, however. Say you want to sample
n elements from a normal distribution, for a sequence of different
n values, and then calculate the standard error of the mean. The vector expression
sd(rnorm(n = n)) / sqrt(n) # not SEM!
does not compute this. If you give
rnorm()
a sequence for its
n parameter, it takes the
length of the input as the number of elements to sample.
n <- seq.int(100, 1000, 300) # the different n value we want
n
# Are we sampling for each of our n values?
rnorm(n) # no, but length(n) is used for the number of samples in rnorm()
## [1] 0.6691846 -0.4554893 -0.8175801 1.7794664
When you have a function that is not vectorized, you can use a map function to apply it on all elements in a list.
sem <- function(n) sd(rnorm(n = n)) / sqrt(n)
n |> map_dbl(sem)
## [1] 0.10197308 0.05138140 0.03673840 0.03057261
Here, I used
map_dbl() to get doubles (numerics) as output. The functions
map_chr() and
map_int()
will give you strings and integers instead.
1:3 %>% map_dbl(identity) %T>% print() %>% class()
1:3 %>% map_chr(identity) %T>% print() %>% class()
1:3 %>% map_int(identity) %T>% print() %>% class()
The different map functions will give you an error if the function you apply does not return values of the type the function should. Sometimes, a map function will do type conversion, as before, but not always. It will usually be happy to convert in a direction that doesn’t lose information, for example, from logical to integer and integer to double, but not in the other direction.
For example, map_lgl() will not convert from integers to logical if you give it the identity() function, even though you can convert these types. Similarly, you can convert strings to integers using as.numeric(), but map_dbl() will not do so if we give it identity(). The mapping functions are more strict than the base R conversion functions. So you should use functions that give you the right type.
The
map_dfr() and
map_dfc() functions
return data frames (tibbles). The function you map should return data frames, and these will be combined, row-wise with
map_dfr() and column-wise with
map_dfc().
x <- tibble(a = 1:2, b = 3:4)
list(a = x, b = x) |> map_dfr(identity)
## # A tibble: 4 × 2
## a b
## <int> <int>
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
list(a = x, b = x) |> map_dfc(identity)
## New names:
## · `a` -> `a...1`
## · `b` -> `b...2`
## · `a` -> `a...3`
## · `b` -> `b...4`
## # A tibble: 2 × 4
## a...1 b...2 a...3 b...4
## <int> <int> <int> <int>
## 1 1 3 1 3
## 2 2 4 2 4
The
map_df() function
does the same as
map_dfr():
list(a = x, b = x) |> map_df(identity)
## # A tibble: 4 × 2
## a b
## <int> <int>
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
You do not need to give the data frame functions data frames as input, as long as the function you apply to the input returns data frames
. This goes for all the map functions. They will accept any sequence input; they only restrict and convert the output.
If the items you map over are sequences themselves, you can extract elements by index; you do not need to provide a function to the map function.
x <- list(1:3, 4:6)
x |> map_dbl(1)
If the items have names, you can also extract values using these.
x <- list(
c(a = 42, b = 13),
c(a = 24, b = 31)
)
x |> map_dbl("a")
This is mostly used when you map over
data frames
.
a <- tibble(foo = 1:3, bar = 11:13)
b <- tibble(foo = 4:6, bar = 14:16)
ab <- list(a = a, b = b)
ab |> map("foo")
## $a
## [1] 1 2 3
##
## $b
## [1] 4 5 6
Related to extracting elements by name or index, you can apply functions to different depths of the input using
map_depth()
. Depth zero is the list itself, so mapping over this depth is the same as applying the function directly on the input.
ab |> map_depth(0, length)
Depth 1 gives us each element in the sequence, so this behaves like a normal map. Depth 2 provides us with a map over the nested elements. Consider the list
ab earlier. The top level, depth 0, is the list. Depth 1 is the data frames
a and
b. Depth 2 is the columns in these data frames. Depth 3 is the individual items in these columns.
ab |> map_depth(1, sum) |> unlist()
ab |> map_depth(2, sum) |> unlist()
## a.foo a.bar b.foo b.bar
## 6 36 15 45
ab |> map_depth(3, sum) |> unlist()
## a.foo1 a.foo2 a.foo3 a.bar1 a.bar2 a.bar3 b.foo1
## 1 2 3 11 12 13 4
## b.foo2 b.foo3 b.bar1 b.bar2 b.bar3
## 5 6 14 15 16
If you only want to apply a function to some of the elements, you can use
map_if()
. It takes a predicate and a function and applies the function to those elements where the predicate is true. It returns a list, but you can convert it if you want another type.
is_even <- function(x) x %% 2 == 0
add_one <- function(x) x + 1
1:6 |> map_if(is_even, add_one) |> as.numeric()
Notice that this is different from combining filtering and mapping; that combination would remove the elements that do not satisfy the predicate.
1:6 |> keep(is_even) |> map_dbl(add_one)
With map_if()
, you keep all elements, but the function is only applied to some of them.
If you want to apply one function to the elements where the predicate is true and another to the elements where it is false, you can prove a function to the
.else element:
add_two <- function(x) x + 2
1:6 |>
map_if(is_even, add_one, .else = add_two) |>
as.numeric()
If you know which indices you want to apply the function to, instead of a predicate they must satisfy, you can use
map_at(). This function takes a sequence of indices instead of the predicate but otherwise works the same as
map_if()
.
1:6 |> map_at(2:5, add_one) |> as.numeric()
If you map over a list,
x, then your function will be called with the elements in the list,
x[[i]]. If you want to get the elements wrapped in a length-one list, that is, use indexing
x[i], you can use
lmap().
list(a = 1:3, b = 4:6) |> map(print) |> invisible()
list(a = 1:3, b = 4:6) |> lmap(print) |> invisible()
## $a
## [1] 1 2 3
##
## $b
## [1] 4 5 6
The function you apply must always return a list, and
lmap()
will concatenate them.
f <- function(x) list("foo")
1:2 |> lmap(f)
## [[1]]
## [1] "foo"
##
## [[2]]
## [1] "foo"
f <- function(x) list("foo", "bar")
1:2 |> lmap(f)
## [[1]]
## [1] "foo"
##
## [[2]]
## [1] "bar"
##
## [[3]]
## [1] "foo"
##
## [[4]]
## [1] "bar"
For example, while you can get the length of elements in a list using
map() and
length()
list(a = 1:3, b = 4:8) |> map(length) |> unlist()
you will get an error if you try the same using
lmap(). This is because
length() returns a numeric and not a list. You need to wrap
length() with
list() so the result is the length in a (length one) list.
wrapped_length <- function(x) {
x |> length() |> # get the length of x (result will be numeric)
list() |> # go back to a list
set_names(names(x)) # and give it the original name
}
list(a = 1:3, b = 4:8) |> lmap(wrapped_length) |> unlist()
If it surprises you that the lengths are one here, remember that the function is called with the length-one lists at each index. If you want the length of what they contain, you need to extract that.
wrapped_length <- function(x) {
x |> pluck(1) |> # pluck(x, 1) from purrr does that same as x[[1]]
length() |> # now we have the underlying vector and get the length
list() |> # go back to a list
set_names(names(x)) # and give it the original name
}
list(a = 1:3, b = 4:8) %>% lmap(wrapped_length) %>% unlist()
If you want to extract the nested data, though, you probably want map() and not lmap().
The functions lmap_if() and lmap_at() work as map_if() and map_at() except for how they index the input and handle the output as lists to be concatenated.
Sometimes, we only want to call a function for its side effect. In that case, you can pipe the result of a map into
invisible(). The function
walk() does that for you, and using it makes it explicit that this is what you want, but it is simply syntactic sugar for
map() +
invisible()
.
1:3 |> map(print) |> invisible()
If you need to map over multiple sequences, you have two choices of map functions to choose from. Some functions map over exactly two sequences. For each of the
map() functions, there are similar
map2() functions. These take two sequences as the first two arguments.
x <- 1:3
y <- 3:1
map2_dbl(x, y, `+`)
You can also create lists of sequences and use the
pmap() functions
.
list(x, y) |> pmap_dbl(`+`)
There are the same type-specific versions as there are for
map() and
map2(), but with the
pmap() functions, you can map over more than one or two input sequences.
z <- 4:6
f <- function(x, y, z) x + y – z
list(x, y, z) |> pmap_dbl(f)
If you need to know the indices for each value you map over, you can use the
imap() variations. When you use these to map over a sequence, your function needs to take two arguments where the first argument is the sequence value and the second the value’s index in the input.
x <- c("foo", "bar", "baz")
f <- function(x, i) paste0(i, ": ", x)
x |> imap_chr(f)
## [1] "1: foo" "2: bar" "3: baz"
There is yet another variant of the mapping functions, the
modify() functions
. These do not have the type variants (but the
_at,
_if,
_depth, and so on); instead, they will always give you an output of the same type as the input:
x <- c("foo", "bar", "baz")
f <- function(x, i) paste0(i, ": ", x)
x |> imodify(f)
## [1] "1: foo" "2: bar" "3: baz"
Reduce and Accumulate
If you want to summarize all your input into a single value, you probably want to reduce() them. Reduce repeatedly applies a function over your input sequence. If you have a function of two arguments, f(a, x), and a sequence x
1, x
2, …, x
n, then reduce(f) will compute f(… f(f(x1, x2), x3), …, xn), that is, it will be called on the first two elements of the sequence, the result will be paired with the next element, and so forth. Think of the argument a as an accumulator
that keeps the result of the calculation so far.
To make the order of function application clear, I define a “pair” type:
pair <- function(first, second) {
structure(list(first = first, second = second),
class = "pair")
}
toString.pair <- function(x, ...) {
first <- toString(x$first, ...)
rest <- toString(x$second, ...)
paste('[', first, ', ', rest, ']', sep = '')
}
print.pair <- function(x, ...) {
x |> toString() |> cat() |> invisible()
}
If we reduce using
pair(), we see how the values are paired when the function is called:
If you reverse the input, you can reduce in the opposite order, combining the last pair first and propagating the accumulator in that order.
1:4 |> rev() |> reduce(pair)
If, for some reason, you want to apply the function
and have the accumulator as the last argument, you can use the
.dir = “backward” argument.
1:4 |> reduce(pair, .dir = "backward")
The first (or last) element in the input does not have to be the value for the initial accumulator. If you want a specific starting value, you can pass that to
reduce() using the
.init argument.
1:3 |> reduce(pair, .init = 0)
1:3 |> rev() |> reduce(pair, .init = 4)
1:3 |> reduce(pair, .init = 4, .dir = "backward")
If your function takes more than one argument, you can provide the additional arguments to
reduce() and then input sequence and function. Consider, for example, a three-argument function like this:
# additional arguments
loud_pair <- function(acc, next_val, volume) {
# Build a pair
ret <- pair(acc, next_val)
# Announce that pair to the world
ret |> toString() |>
paste(volume, '
', sep = '') |>
cat()
# Then return the new pair
ret
}
It builds a pair object but, as a side effect, prints the pair followed by a string that indicates how “loud” the printed value is. We can provide the volume as an extra argument to
reduce():
1:3 |>
reduce(loud_pair, volume = '!') |>
invisible()
## [1, 2]!
## [[1, 2], 3]!
1:3 |>
reduce(loud_pair, volume = '!!') |>
invisible()
## [1, 2]!!
## [[1, 2], 3]!!
If you want to reduce two sequences instead of one—similar to a second argument to
reduce()
but a sequence instead of a single value—you can use
reduce2():
volumes <- c('!', '!!')
1:3 |> reduce2(volumes, loud_pair) |> invisible()
## [1, 2]!
## [[1, 2], 3]!!
1:3 |>
reduce2(c('!', '!!', '!!!'), .init = 0, loud_pair) |>
invisible()
## [0, 1]!
## [[0, 1], 2]!!
## [[[0, 1], 2], 3]!!!
If you want all the intermediate values of the reductions, you can use the
accumulate() function
. It returns a sequence of the results of each function application.
res <- 1:3 |> accumulate(pair)
print(res[[1]])
res <- 1:3 |> accumulate(pair, .init = 0)
print(res[[1]])
res <- 1:3 |> accumulate(
pair, .init = 0,
.dir = "backward"
)
print(res[[1]])
The accumulate2() function works like reduce2(), except that it keeps the intermediate values like accumulate() does.
Partial Evaluation and Function Composition
When you filter, map, or reduce over sequences, you sometimes want to modify a function to match the interface of purrr’s functions. If you have a function that takes too many arguments for the interface, but where you can fix some of the parameters to get the application you want, you can do what is called a partial evaluation
. This just means that you create a new function that calls the original function with some of the parameters fixed.
For example, if you filter, you want a function that takes one input value and returns one (Boolean) output value. If you want to filter the values that are less than or greater than, say, three, you can create functions for this.
greater_than_three <- function(x) 3 < x
less_than_three <- function(x) x < 3
1:6 |> keep(greater_than_three)
1:6 |> keep(less_than_three)
The drawback of doing this is that you might need to define many such functions, even if you only use each once in your pipeline.
Using the
partial() function, you can bind parameters without explicitly defining new functions. For example, to bind the first parameter to
<, as in the
greater_than_three() function, you can use
partial():
1:6 |> keep(partial(`<`, 3))
By default, you always bind the first parameter(s). To bind others, you need to name which parameters to bind. The less than operator has these parameter names:
## function (e1, e2) .Primitive("<")
so you can use this partial evaluation for
less_than_three():
1:6 |> keep(partial(`<`, e2 = 3))
Similarly, you can use partial evaluation for mapping:
1:6 |> map_dbl(partial(`+`, 2))
1:6 |> map_dbl(partial(`-`, 1))
1:3 |> map_dbl(partial(`-`, e1 = 4))
1:3 |> map_dbl(partial(`-`, e2 = 4))
If you need to apply more than one function, for example
1:3 |>
map_dbl(partial(`+`, 2)) |>
map_dbl(partial(`*`, 3))
you can also simply combine the functions. The function composition, °, works as this: (g ° f)(x) = g(f(x)).
So the pipeline earlier can also be written:
1:3 |> map_dbl(
compose(partial(`*`, 3), partial(`+`, 2))
)
With partial() and combine(), you can modify functions, but using them does not exactly give you code that is easy to read. A more readable alternative is using lambda expressions.
Lambda Expressions
Lambda expressions
are a concise syntax for defining anonymous functions, that is, functions that we do not name. The name “lambda expressions” comes from “lambda calculus,”1 a discipline in formal logic, but in computer science, it is mostly used as a synonym for anonymous functions. In some programming languages, you cannot create anonymous functions; you need to name a function to define it. In other languages, you have special syntax for lambda expressions. In R, you define anonymous functions the same way that you define named functions. You always define functions the same way; you only give them a name when you assign a function definition to a variable.
If this sounds too theoretical, consider this example. When we filtered values that are even, we defined a function,
is_even(), to use as a predicate.
is_even <- function(x) x %% 2 == 0
1:6 |> keep(is_even)
We defined the function using the expression
function(x) x %% 2 == 0, and then we assigned the function to the name
is_even. Instead of assigning the function to a name, we could use the function definition directly in the call to
keep():
1:6 |> keep(function(x) x %% 2 == 0)
R 4 introduced new syntax for defining functions, giving us a slightly shorter form:
1:6 |> keep((x) x %% 2 == 0)
that I tend to use in situations such as this, but this is entirely a question of taste. There is nothing wrong with the function(…) … syntax.
So, R already has lambda expressions, but the syntax can be verbose, especially before R 4 and the
(...) ... syntax. In
purrr, you can use a formula as a lambda expression instead of a function. You can define an
is_even() function using a formula like this:
is_even_lambda <- ~ .x %% 2 == 0
1:6 |> keep(is_even_lambda)
or use the formula directly in your pipeline.
1:6 |> keep(~ .x %% 2 == 0)
The variable .x in the expression will be interpreted as the first argument to the lambda expression.
Lambda expressions are not an approach you can use to define functions in general. They only work because
purrr functions understand formulae as function specifications. You cannot write
is_even_lambda <- ~ .x %% 2 == 0
is_even_lambda(3)
This will give you an error. The error message is not telling you that you try to use a formula as a function, unfortunately, but just that it cannot find the function
is_even_lambda. This is because R will look for a function when you call a variable as a function and ignore other variables with the same name. If you reassign to a variable
f <- function(x) 2 * x
f <- 5
f(2)
## Error in f(2): could not find function "f"
you are told that it cannot find a function with the name you call—not that the variable does not refer to a nonfunction. That is the error you get if you attempt to call a lambda expression outside a purrr function (or certain other Tidyverse functions
).
R’s rule for looking for functions can be even more confusing if there is a variable in an inner scope and a function in an outer scope:
f <- function(x) 2 * x
g <- function() {
f <- 5 # not a function
f(2) # will look for a function
}
g()
Here, the f() function in the outer scope is called because it is a function; the variable in the inner scope is ignored.
Getting back to lambda expressions in
purrr functions, you can use them as more readable versions of partial evaluation.
Or you can use them for more readable versions of function composition.
1:3 |> map_dbl(~ 3 * (.x + 2))
If you need a lambda expression with two arguments, you can use
.x and
.y as the first and second arguments, respectively.
map2_dbl(1:3, 1:3, ~ .x + .y)
If you need more than two arguments, you can use .
n for the
nth argument:
list(1:3, 1:3, 1:3) %>% pmap_dbl(~ .1 + .2 + .3)