The
dplyr package resembles the functionality in the
purrr package, but it is designed for manipulating data frames. It will be loaded with the
tidyverse package, but you can also load it using
The usual way that you use dplyr is similar to how you use purrr. You string together a sequence of actions in a pipeline, with the actions separated by the %>% operator. The difference between the two packages is that the purrr functions work on sequences while the dplyr functions
work on data frames.
This package is huge and more functionality is added in each new release, so it would be impossible for me to describe everything you can do with dplyr. I can only give you a flavor of the package functionality and refer you to the documentation for more information.
Selecting Columns
One of the simplest operations you can do on a data frame is selecting one or more of its
columns
. You can do this by indexing with
$:
iris_df <- as_tibble(iris)
print(iris_df, n = 3)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
This, however, does not work well with pipelines where a data frame is flowing through from action to action. In dplyr
, you have the function select()
for picking out columns. We explored it in Chapter 4, but let’s see a few examples more.
You can give the function one or more column names, and it will give you a data frame containing only those columns.
iris_df |>
select(Sepal.Length, Species) |>
print(n = 3)
## # A tibble: 150 × 2
## Sepal.Length Species
## <dbl> <fct>
## 1 5.1 setosa
## 2 4.9 setosa
## 3 4.7 setosa
## # . . . with 147 more rows
You can also use complements when selecting columns; just put a
! in front of the
column names
.
iris_df |>
select(!Species) |>
print(n = 3)
## # A tibble: 150 × 4
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 1 more variable:
## # Petal.Width <dbl>
iris_df |>
select(!c(Species, Sepal.Length)) |>
print(n = 3)
## # A tibble: 150 × 3
## Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl>
## 1 3.5 1.4 0.2
## 2 3 1.4 0.2
## 3 3.2 1.3 0.2
## # . . . with 147 more rows
You can also use minus to
remove columns
from a selection:
iris_df |>
select(-Species) |>
print(n = 3)
## # A tibble: 150 × 4
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 1 more variable:
## # Petal.Width <dbl>
iris_df |>
select(-c(Species, Sepal.Length)) |>
print(n = 3)
## # A tibble: 150 × 3
## Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl>
## 1 3.5 1.4 0.2
## 2 3 1.4 0.2
## 3 3.2 1.3 0.2
## # . . . with 147 more rows
but ! and - do not behave exactly the same way when you provide more than one argument to select().
With
! you ask for all columns except those you exclude, so if you
write
iris_df |>
select(!Species, !Sepal.Length) |>
print(n = 3)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
you first ask for all columns except Species and then for all columns except
Sepal.Length
, and you get the union of the two selections (which includes everything).
With minus, you ask for the columns to be removed instead, so with
iris_df |>
select(-Species, -Sepal.Length) |>
print(n = 3)
## # A tibble: 150 × 3
## Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl>
## 1 3.5 1.4 0.2
## 2 3 1.4 0.2
## 3 3.2 1.3 0.2
## # . . . with 147 more rows
you first remove Species and then Sepal.Length so neither are included in the final table.
You do not need to use column names. You can also use indices.
iris_df |>
select(1) |>
print(n = 3)
## # A tibble: 150 × 1
## Sepal.Length
## <dbl>
## 1 5.1
## 2 4.9
## 3 4.7
## # . . . with 147 more rows
iris_df |>
select(1:3) |>
print(n = 3)
## # A tibble: 150 × 3
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows
This, of course, is less informative about what is being selected when you read the code. The second selection earlier shows that you can extract ranges using indices the same way as we can using column names.
iris_df |>
select(Petal.Length:Species) |>
print(n = 3)
## # A tibble: 150 × 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## # . . . with 147 more rows
All the other functions described in Chapter
4 work for
select()
, of course:
iris_df |>
select(starts_with("Petal")) |>
print(n = 3)
## # A tibble: 150 × 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## # . . . with 147 more rows
iris_df |>
select(-starts_with("Petal")) |>
print(n = 3)
## # A tibble: 150 × 3
## Sepal.Length Sepal.Width Species
## <dbl> <dbl> <fct>
## 1 5.1 3.5 setosa
## 2 4.9 3 setosa
## 3 4.7 3.2 setosa
## # . . . with 147 more rows
iris_df |>
select(starts_with("Petal"), Species) |>
print(n = 3)
## # A tibble: 150 × 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## # . . . with 147 more rows
iris_df |>
select(starts_with("PETAL", ignore.case = TRUE)) |>
print(n = 3)
## # A tibble: 150 × 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## # . . . with 147 more rows
iris_df |>
select(starts_with("S")) |>
print(n = 3)
## # A tibble: 150 × 3
## Sepal.Length Sepal.Width Species
## <dbl> <dbl> <fct>
## 1 5.1 3.5 setosa
## 2 4.9 3 setosa
## 3 4.7 3.2 setosa
## # . . . with 147 more rows
iris_df |>
select(ends_with("Length")) |>
print(n = 3)
## # A tibble: 150 × 2
## Sepal.Length Petal.Length
## <dbl> <dbl>
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## # . . . with 147 more rows
iris_df |>
select(contains("ng")) |>
print(n = 3)
## # A tibble: 150 × 2
## Sepal.Length Petal.Length
## <dbl> <dbl>
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## # . . . with 147 more rows
You can also use
select() to rename columns. If you use a named parameter, the columns you select will get the names you use for the function parameters.
iris_df |>
select(sepal_length = Sepal.Length,
sepal_width = Sepal.Width) |>
print(n = 3)
## # A tibble: 150 × 2
## sepal_length sepal_width
## <dbl> <dbl>
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## # . . . with 147 more rows
Since
select() removes the columns you do
not select, it might not always be your best choice for what you need to do. A similar function to
select(), where you do not lose columns, is
rename()
. With this function, you can rename columns and keep the remaining columns as well.
iris_df |>
rename(sepal_length = Sepal.Length,
sepal_width = Sepal.Width) |>
print(n = 3)
## # A tibble: 150 × 5
## sepal_length sepal_width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
Filter
While select() extracts a subset of columns
, the filter() function
does the same for rows.
The
iris_df
contains three different species. We can see this using the
distinct() function
. This function is also from the
dplyr package, and it gives you all the unique rows from selected columns.
iris_df |> distinct(Species)
## # A tibble: 3 × 1
## Species
## <fct>
## 1 setosa
## 2 versicolor
## 3 virginica
We can select the rows where the species is “setosa” using
filter():
iris_df |>
filter(Species == "setosa") |>
print(n = 3)
## # A tibble: 50 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 47 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We can combine
filter() and
select()
in a pipeline to get a subset of the columns as well as a subset of the rows.
iris_df |>
filter(Species == "setosa") |>
select(ends_with("Length"), Species) |>
print(n = 3)
## # A tibble: 50 × 3
## Sepal.Length Petal.Length Species
## <dbl> <dbl> <fct>
## 1 5.1 1.4 setosa
## 2 4.9 1.4 setosa
## 3 4.7 1.3 setosa
## # . . . with 47 more rows
Generally, we can string together as many
dplyr
(or
purrr) functions as we desire in pipelines.
iris_df |>
filter(Species != "setosa") |>
distinct(Species) |>
print(n = 3)
## # A tibble: 2 × 1
## Species
## <fct>
## 1 versicolor
## 2 virginica
We can use more than one column to filter by:
iris_df |>
filter(Sepal.Length > 5, Petal.Width < 0.4) |>
print(n = 3)
## # A tibble: 15 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 5.4 3.7 1.5
## 3 5.8 4 1.2
## # . . . with 12 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We can also use functions as a predicate, for example, the
between() function
to select numbers in a given range.
iris_df |>
filter(between(Sepal.Width, 2, 2.5)) |>
print(n = 3)
## # A tibble: 19 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.5 2.3 1.3
## 2 5.5 2.3 4
## 3 4.9 2.4 3.3
## # . . . with 16 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We cannot use the functions
starts_with(),
ends_with(), etc. that we can use with
select(). This does not mean, however, that we cannot filter rows using string patterns. We just need different functions. For example, we can use
str_starts() from the
stringr package
. We return to
stringr in Chapter
9.
iris_df |>
filter(str_starts(Species, "v")) |>
print(n = 3)
## # A tibble: 100 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 7 3.2 4.7
## 2 6.4 3.2 4.5
## 3 6.9 3.1 4.9
## # . . . with 97 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df |>
filter(str_ends(Species, "r")) |>
print(n = 3)
## # A tibble: 50 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 7 3.2 4.7
## 2 6.4 3.2 4.5
## 3 6.9 3.1 4.9
## # . . . with 47 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
While filter() selects rows by applying predicates on individual columns, other filter variants will use predicates over more than one column.
The filter_all() function
will apply the predicate over all columns. You must provide an expression where you use . (dot) for the table cell value and wrap the expression in one of two functions: any_var() or all_var()
. If you use any_var(), it suffices that one column satisfies the predicate for the row to be included; if you use all_var(), then all columns must satisfy the predicate.
We can require that any value in the
iris_df data must be larger than five:
iris_df |>
select(-Species) |>
filter_all(any_vars(. > 5)) |>
print(n = 3)
## # A tibble: 118 × 4
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 5.4 3.9 1.7
## 3 5.4 3.7 1.5
## # . . . with 115 more rows, and 1 more variable:
## # Petal.Width <dbl>
The expression
. > 5 is not meaningful if
. is a string, so I had to remove the
Species column before I filtered. Rarely will we accept to lose an informative variable. If you want to keep a column, but not apply the predicate to values in it, you can use the
filter_at() function
.
iris_df |>
filter_at(vars(-Species), any_vars(. > 5)) |>
print(n = 3)
## # A tibble: 118 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 5.4 3.9 1.7
## 3 5.4 3.7 1.5
## # . . . with 115 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can give
filter_at() a vector of strings or a vector of
variables
(unquoted column names). Because I need to negate
Species to select all other columns, I used
vars() here. In other cases, you can use either option.
iris_df |>
filter_at(c("Petal.Length", "Sepal.Length"),
any_vars(. > 0)) |>
print(n = 3)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df |>
filter_at(vars(Petal.Length, Sepal.Length),
any_vars(. > 0)) |>
print(n = 3)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can use
filter_if()
to use a predicate over the columns:
iris_df |>
filter_if(is.numeric, all_vars(. < 5)) |>
print(n = 3)
## # A tibble: 22 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.9 3 1.4
## 2 4.7 3.2 1.3
## 3 4.6 3.1 1.5
## # . . . with 19 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can use functions or lambda expressions as the predicate.
Consider a case where we have numbers with missing data in a table. If we want to filter the rows where all values are greater than three, we can use
filter_all()
:
df <- tribble(
~A, ~B, ~C,
1, 2, 3,
4, 5, NA,
11, 12, 13,
22, 22, 1
)
df |> filter_all(all_vars(. > 3))
## # A tibble: 1 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 11 12 13
It removes the first two and the last row because they all contain values smaller or equal to three. It also deletes the third row because
NA is not considered greater than three. We can restrict the tests to the columns that do not contain missing values:
df |> filter_if(~ all(!is.na(.)), all_vars(. > 3))
## # A tibble: 3 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 4 5 NA
## 2 11 12 13
## 3 22 22 1
This removes column C from the tests. It means that we also keep the last row even though C has a value smaller than three.
If you want to keep the two middle row but not the first or last, you can write a more complex predicate and use
filter_all()
.
df |> filter_all(all_vars(is.na(.) | . > 3))
## # A tibble: 2 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 4 5 NA
## 2 11 12 13
You need to use the | vector-or. The expression needs to be a vector expression so you cannot use ||.
Many dplyr functions have _all, _at, and _if variants, but I will only describe the frequently used functions here. For the others, I refer to the package documentation.
Sorting
If you want to sort rows by values in selected columns, the
function
you want is called
arrange()
. You can sort by one or more columns:
iris_df |>
arrange(Petal.Length) |>
print(n = 5)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.6 3.6 1
## 2 4.3 3 1.1
## 3 5.8 4 1.2
## 4 5 3.2 1.2
## 5 4.7 3.2 1.3
## # . . . with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df |>
arrange(Sepal.Length, Petal.Length) |>
print(n = 5)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.3 3 1.1
## 2 4.4 3 1.3
## 3 4.4 3.2 1.3
## 4 4.4 2.9 1.4
## 5 4.5 2.3 1.3
## # . . . with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
If you want to sort in descending order, you can use the function
desc()
.
iris_df |>
arrange(desc(Petal.Length)) |>
print(n = 5)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 7.7 2.6 6.9
## 2 7.7 3.8 6.7
## 3 7.7 2.8 6.7
## 4 7.6 3 6.6
## 5 7.9 3.8 6.4
## # . . . with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
Modifying Data Frames
When we work with data frames, we usually want to compute values based on the variables (columns) in our tables. Filtering rows and columns will only get us so far.
The
mutate() function
lets us add columns to a data frame based on expressions that can involve any of the existing columns. Consider a table of widths and heights.
df <- tribble(
~height, ~width,
10, 12,
42, 24,
14, 12
)
We can add an area column using
mutate() and this expression:
df |> mutate(area = height * width)
## # A tibble: 3 × 3
## height width area
## <dbl> <dbl> <dbl>
## 1 10 12 120
## 2 42 24 1008
## 3 14 12 168
If you add columns to a data frame, you can refer to variables you have already added. For example, if your height and width data are in inches and you want them in centimeters, and you also want the area in
centimeters
squared, you can do this:
cm_per_inch <- 2.54
df |> mutate(
height_cm = cm_per_inch * height,
width_cm = cm_per_inch * width,
area_cm = height_cm * width_cm
)
## # A tibble: 3 × 5
## height width height_cm width_cm area_cm
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 12 25.4 30.5 774.
## 2 42 24 107. 61.0 6503.
## 3 14 12 35.6 30.5 1084.
The following expression, however, will not work since you cannot refer to variables you are adding later in the
mutate()
call:
df %>% mutate(
area_cm = height_cm * width_cm,
height_cm = cm_per_inch * height,
width_cm = cm_per_inch * width
)
## Error in `mutate()`:
## ! Problem while computing `area_cm = height_cm * width_cm`.
## Caused by error:
## ! object 'height_cm' not found
If you do not give the new variable a name, that is, you call
mutate() with named
parameters
, then the expression will become the variable name:
df |> mutate(cm_per_inch * height)
## # A tibble: 3 × 3
## height width `cm_per_inch * height`
## <dbl> <dbl> <dbl>
## 1 10 12 25.4
## 2 42 24 107.
## 3 14 12 35.6
In this example, the units (inches and centimeters) are not encoded with the data, and in future computations you might end up mixing the wrong units. If you use the
units package
—which is not part of the Tidyverse—you can make the units explicit, automatically convert between units where that makes sense, and you will get unit type-safety as a bonus.
library(units)
df |> mutate(
height_in = units::as_units(height, "in"),
width_in = units::as_units(width, "in"),
area_in = height_in * width_in,
height_cm = units::set_units(height_in, "cm"),
width_cm = units::set_units(width_in, "cm"),
area_cm = units::set_units(area_in, "cmˆ2")
)
## # A tibble: 3 × 8
## height width height_in width_in area_in
## <dbl> <dbl> [in] [in] [in^2]
## 1 10 12 10 12 120
## 2 42 24 42 24 1008
## 3 14 12 14 12 168
## # . . . with 3 more variables: height_cm [cm],
## # width_cm [cm], area_cm [cm^2]
This is possible because tibbles can contain classes of various types, which the units package exploits. Tibbles use an interface that lets you define how your own classes should be displayed,1 but this is beyond the scope of this book.
Earlier, we created unit-carrying values for the original data table, but the original height and weight columns are still around. We could use
select() to remove them, but the function
transmute()
already does this for us.
df |> transmute(
height_in = units::as_units(height, "in"),
width_in = units::as_units(width, "in"),
area_in = height_in * width_in,
height_cm = units::set_units(height_in, "cm"),
width_cm = units::set_units(width_in, "cm"),
area_cm = units::set_units(area_in, "cmˆ2")
)
## # A tibble: 3 × 6
## height_in width_in area_in height_cm width_cm
## [in] [in] [in^2] [cm] [cm]
## 1 10 12 120 25.4 30.5
## 2 42 24 1008 107. 61.0
## 3 14 12 168 35.6 30.5
## # . . . with 1 more variable: area_cm [cm^2]
When you use transmute() instead of mutate(), the variables that you are not assigning to will be removed.
As with filtering, the expressions you use when mutating a data frame must be vector expressions. Consider this example:
df <- tibble(
x = rnorm(3, mean = 12, sd = 5),
)
df |> mutate(~ if (.x < 0) -.x else .x)
## Error in `mutate()`:
## ! Problem while computing `..1 = ~if (.x < 0) -.x else .x`.
## x `..1` must be a vector, not a `formula` object.
It fails because the function my_abs() (my version of the abs() function) is not a vector expression. It is not a vector expression because of the x in the if expression.
If we use the built-in
abs() function
, then there is no problem with the expression; it handles vectors.
## # A tibble: 3 × 2
## x `abs(x)`
## <dbl> <dbl>
## 1 11.5 11.5
## 2 20.9 20.9
## 3 17.2 17.2
If you use the
ifelse()
vector version of
if expressions, you get a vector expression.
df |> mutate(ifelse(x < 0, -x, x))
## # A tibble: 3 × 2
## x `ifelse(x < 0, -x, x)`
## <dbl> <dbl>
## 1 11.5 11.5
## 2 20.9 20.9
## 3 17.2 17.2
Or you can use the
Vectorize() function
to make a vector expression out of a function that does not handle vectors.
my_abs <- Vectorize((x) if (x < 0) -x else x)
df |> mutate(my_abs(x))
## # A tibble: 3 × 2
## x `my_abs(x)`
## <dbl> <dbl>
## 1 11.5 11.5
## 2 20.9 20.9
## 3 17.2 17.2
If you need to map the input to several different output values, you can nest
ifelse()
expressions arbitrarily deep, but it gets difficult to read. A function that alleviates this problem substantially is case_when(). You can give it a sequence of predicates with matching expressions, and it will return the expression for the first matching predicate.
Consider this example where we use
case_when()
to categorize the variable
x based on its value:
df <- tibble(x = rnorm(100))
df |>
mutate(
x_category = case_when(
x - mean(x) < -2 * sd(x) ~ "small",
x - mean(x) > 2 * sd(x) ~ "large",
TRUE ~ "medium"
)
) |>
print(n = 3)
## # A tibble: 100 × 2
## x x_category
## <dbl> <chr>
## 1 -0.736 medium
## 2 -0.478 medium
## 3 0.796 medium
## # . . . with 97 more rows
The TRUE line corresponds to the else part of an if-else statement.
Grouping and Summarizing
In many analyses, we need summary statistics of our data. If you can map one or more of your columns into a single summary, you can use the
summarise() function
.
df <- tibble(x = rnorm(100), y = rnorm(100))
df |> summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 1 × 2
## mean_x mean_y
## <dbl> <dbl>
## 1 -0.154 0.0224
In this example, we summarized our data as the mean of variables x and y.
If you split your data into different classes and want to work on the data per group, you can use the function
group_by()
.
classify <- function(x) {
case_when(
x - mean(x) < -2 * sd(x) ~ "small",
x - mean(x) > 2 * sd(x) ~ "large",
TRUE ~ "medium"
)
}
df |> mutate(x_category = classify(x)) |>
group_by(x_category) |>
print(n = 3)
## # A tibble: 100 × 3
## # Groups: x_category [3]
## x y x_category
## <dbl> <dbl> <chr>
## 1 -2.08 -1.52 medium
## 2 0.580 -0.990 medium
## 3 -0.995 -0.443 medium
## # . . . with 97 more rows
The result is a data frame that, when we print it, doesn’t look different from before we grouped the data. In the header,
however,
you will notice the line:
This tells you that something
is different when you have grouped your data. This is apparent when you combine grouping with summarizing.
df |> mutate(x_category = classify(x)) |>
group_by(x_category) |>
summarise(mean_x = mean(x), no_x = n())
## # A tibble: 3 × 3
## x_category mean_x no_x
## <chr> <dbl> <int>
## 1 large 2.04 2
## 2 medium -0.155 96
## 3 small -2.30 2
When you group your data and then summarize, you get a per-group summary statistics. Here, we calculate the mean and the number of observations in each category (the function n() gives you the number of observations).
You can get the variables your data is grouped by using the
group_vars() function
.
df |> mutate(x_category = classify(x)) |>
group_by(x_category) |>
group_vars()
You can group data by more than one
variable
.
df <- tibble(x = rnorm(100), y = rnorm(100)) |>
mutate(
x_category = classify(x),
y_category = classify(y)
)
df |> group_by(x_category, y_category) |>
group_vars()
## [1] "x_category" "y_category"
If you do, you will get summaries for each combination of the
grouping variables
.
df |> group_by(x_category, y_category) |>
summarise(mean_x = mean(x), mean_y = mean(y))
## `summarise()` has grouped output by 'x_category'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 4
## # Groups: x_category [3]
## x_category y_category mean_x mean_y
## <chr> <chr> <dbl> <dbl>
## 1 large medium 2.08 1.33
## 2 medium large 0.808 2.69
## 3 medium medium 0.0755 0.141
## 4 medium small 1.05 -2.38
## 5 small medium -2.55 -1.12
The notice you get that the output has “grouped output by ‘x_category’” means just that: the table we get out from the call to summarise() is still grouped by x_category, as you can also see in the printed output of the result (in the Groups: x_category [3] line in the header).
When you group by multiple variables, it works in effect as if you add groups one by one, and you can also do this via multiple calls to
group_by(). There, however, you need the argument
.add = TRUE to add a group rather than replacing one, which is what
group_by()
will do without this argument.
# Adding two categories at the same time
df |> group_by(x_category, y_category) |> group_vars()
## [1] "x_category" "y_category"
# Adding categories in two steps
df |> group_by(x_category) |> group_by(y_category, .add = TRUE) |> group_vars()
## [1] "x_category" "y_category"
# Replacing one grouping with another
df |> group_by(x_category) |> group_by(y_category) |> group_vars()
When you summarize, you remove the last grouping you created—the summary only has one row per value in that group anyway, so it isn’t useful any longer—but the other groups remain. This was the default behavior in early versions of dplyr, and still is, but is a common source of errors. Some data scientists expect this behavior, others expect that all groups are removed, and if your expectations are not met, there is a good chance that the following analysis will be wrong; if you think you are working on the data as a whole, but all operations are grouped by one or more variables, you are unlikely to get the results that you want.
Because of this,
summarise()
now takes the argument
.groups where you can specify the behavior you want. It is not yet required, since that would break backward compatibility to a lot of code, but you get the message about using it that we saw a little while up.
# Drop the last group we made
df |> group_by(x_category, y_category) |>
summarise(mean_x = mean(x), mean_y = mean(y), .groups = "drop_last") |>
group_vars()
## [1] "x_category"
# Drop all groups
df |> group_by(x_category, y_category) |>
summarise(mean_x = mean(x), mean_y = mean(y), .groups = "drop") |>
group_vars()
# Keep all groups
df |> group_by(x_category, y_category) |>
summarise(mean_x = mean(x), mean_y = mean(y), .groups = "keep") |>
group_vars()
## [1] "x_category" "y_category"
If you have grouped your data, or you get grouped data from one summary as before, you can remove the groups again using
ungroup()
. Once you have done this, you can compute global summaries again.
df |> group_by(x_category, y_category) |>
ungroup() |> # Remove the grouping again
group_vars()
Grouping is not only useful when collecting summaries of your data. You can also add columns to your data frame based on per-group computations using
mutate()
:
df |> group_by(x_category) |>
mutate(mean_x = mean(x), mean_y = mean(y)) |>
ungroup() |>
print(n = 5)
## # A tibble: 100 × 6
## x y x_category y_category mean_x
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 -0.416 -0.374 medium medium 0.123
## 2 -0.845 -1.21 medium medium 0.123
## 3 -0.0176 0.103 medium medium 0.123
## 4 -1.18 -0.0132 medium medium 0.123
## 5 0.508 0.504 medium medium 0.123
## # . . . with 95 more rows, and 1 more variable:
## # mean_y <dbl>
When you calculate a summary and then mutate, you create columns where a summary parameter takes values based on the groupings, but they are combined with the existing data rather than extracted as a new data frame as summarise() would do.
A
mutate() without a
group_by()
will give you summaries for the entire data.
df |> mutate(mean_x = mean(x), mean_y = mean(y)) |> # Calculate mean x and y for entire data
distinct(mean_x, mean_y) # Get unique values for printing
## # A tibble: 1 × 2
## mean_x mean_y
## <dbl> <dbl>
## 1 0.0895 0.0520
I have used distinct()
to get the unique values of r mean_x and mean_y combinations, as this well illustrate the result. Here, we only have one row of values, and this is because we have computed the mean values for the entire data set and not for different groupings of the data.
In contrast, if you group before adding variables to the data, then the summaries are per group.
df |> group_by(x_category) |> # Group by x categories only
mutate(mean_x = mean(x), mean_y = mean(y)) |> # Calculate mean x and y for each x group
distinct(mean_x, mean_y) # Get unique values for printing
## # A tibble: 3 × 3
## # Groups: x_category [3]
## x_category mean_x mean_y
## <chr> <dbl> <dbl>
## 1 medium 0.123 0.0631
## 2 large 2.08 1.33
## 3 small -2.55 -1.12
Here, there are values for different x_category factors, showing us that we have computed means for each different x_category in the data.
You can combine data-wide
summaries
with grouped summaries if you calculate the global summaries before you group.
df |> mutate(mean_y = mean(y)) |> # Get the mean for y over all data points
group_by(x_category) |> # Then group so the following works per group
mutate(mean_x = mean(x)) |> # Get the mean x for each group (not globally)
distinct(mean_x, mean_y) # Get the unique values so we can print the result
## # A tibble: 3 × 3
## # Groups: x_category [3]
## x_category mean_y mean_x
## <chr> <dbl> <dbl>
## 1 medium 0.0520 0.123
## 2 large 0.0520 2.08
## 3 small 0.0520 -2.55
Notice that the mean_y values are the same in the output rows; this is because we have computed the mean globally and not for each group.
If you need to compute summaries with different groups, for example, the mean of
x for each
x_category as well as the mean of
y for each
y_category, then you can call
group_by(x_category) for summarizing
x, followed by
group_by(
y_category) for changing the grouping so you can summarize
y.
df |> group_by(x_category) |>
mutate(mean_x = mean(x)) |>
group_by(y_category) |>
mutate(mean_y = mean(y)) |>
distinct(
x_category, mean_x,
y_category, mean_y
)
## # A tibble: 5 × 4
## # Groups: y_category [3]
## x_category y_category mean_x mean_y
## <chr> <chr> <dbl> <dbl>
## 1 medium medium 0.123 0.127
## 2 medium small 0.123 -2.38
## 3 large medium 2.08 0.127
## 4 medium large 0.123 2.69
## 5 small medium -2.55 0.127
Joining Tables
It is not uncommon to have your data in more than one table. This could be because the tables are created from different calculations, for example, different kinds of summaries, or it can be because you got the data from different files.
If you merely need to combine tables by row or column, then you can use the
bind_rows() and
bind_columns() functions
which do precisely what you would expect.
df1 <- tibble(
A = paste0("a", 1:2),
B = paste0("b", 1:2)
)
df2 <- tibble(
A = paste0("a", 3:4),
B = paste0("b", 3:4)
)
df3 <- tibble(
C = paste0("c", 1:2),
D = paste0("d", 1:2)
)
bind_rows(df1, df2)
## # A tibble: 4 × 2
## A B
## <chr> <chr>
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
## 4 a4 b4
## # A tibble: 2 × 4
## A B C D
## <chr> <chr> <chr> <chr>
## 1 a1 b1 c1 d1
## 2 a2 b2 c2 d2
When you combine rows, the tables must have the same columns, and when you combine by column, the tables must have the same number of rows.
If you have tables that represent different relations between variables—the underlying principle of relational databases aimed at avoiding duplicated data—then you can combine them using join functions.
Say you have a table that maps students to grades for each class you teach. You can join tables from two different classes using
inner_join()
. You use a key to join them on, specified by argument
by.
grades_maths <- tribble(
~name, ~grade,
"Marko Polo", "D",
"Isaac Newton", "A+",
"Charles Darwin", "B"
)
grades_biology <- tribble(
~name, ~grade,
"Marko Polo", "F",
"Isaac Newton", "D",
"Charles Darwin", "A+"
)
inner_join(grades_maths, grades_biology, by = "name")
## # A tibble: 3 × 3
## name grade.x grade.y
## <chr> <chr> <chr>
## 1 Marko Polo D F
## 2 Isaac Newton A+ D
## 3 Charles Darwin B A+
This tells inner_join()
that you want to combine all rows in the first table with all rows in the second, where the two rows have the same name. You can use more than one key in a join if you give by a vector of variable names.
Earlier, each name appears once per table. If a key appears more than once, then the result of an inner join will have a list with all combinations of rows sharing a name.
grades_maths2 <- tribble(
~name, ~grade,
"Marko Polo", "D",
"Isaac Newton", "A+", # so good at physics
"Isaac Newton", "A+", # that he got an A+ twice
"Charles Darwin", "B"
)
grades_biology2 <- tribble(
~name, ~grade,
"Marko Polo", "F",
"Isaac Newton", "D",
"Charles Darwin", "A+", # so good at biology that we
"Charles Darwin", "A+" # listed him twice
)
inner_join(grades_maths2, grades_biology2, by = "name")
## # A tibble: 5 × 3
## name grade.x grade.y
## <chr> <chr> <chr>
## 1 Marko Polo D F
## 2 Isaac Newton A+ D
## 3 Isaac Newton A+ D
## 4 Charles Darwin B A+
## 5 Charles Darwin B A+
inner_join(grades_maths2, grades_biology2, by = "grade")
## # A tibble: 5 × 3
## name.x grade name.y
## <chr> <chr> <chr>
## 1 Marko Polo D Isaac Newton
## 2 Isaac Newton A+ Charles Darwin
## 3 Isaac Newton A+ Charles Darwin
## 4 Isaac Newton A+ Charles Darwin
## 5 Isaac Newton A+ Charles Darwin
In the last join, you see that you can get the same line multiple times from an inner join. Combine the join with
distinct()
if you want to avoid this.
inner_join(grades_maths2, grades_biology2, by = "grade") |> distinct()
## # A tibble: 2 × 3
## name.x grade name.y
## <chr> <chr> <chr>
## 1 Marko Polo D Isaac Newton
## 2 Isaac Newton A+ Charles Darwin
The tables can have different variables—if they represent different relations—or they can share several or all variables. In the grade example, each table contains the same name/grade relationships, just for different classes. Therefore, they also share variables.
With shared variables, the data from the two tables can be differentiated by a suffix. The default, see above, is “.x” and “.y”. You can change that using the
suffix argument
.
inner_join(
grades_maths, grades_biology,
by = "name", suffix = c(".maths", ".biology")
)
## # A tibble: 3 × 3
## name grade.maths grade.biology
## <chr> <chr> <chr>
## 1 Marko Polo D F
## 2 Isaac Newton A+ D
## 3 Charles Darwin B A+
Students might take different classes, and you have several choices on how to combine tables with different keys.
An
inner_join()
will only give you the rows where a key is in both tables.
grades_geography <- tribble(
~name, ~grade,
"Marko Polo", "A",
"Charles Darwin", "A",
"Immanuel Kant", "A+"
)
grades_physics <- tribble(
~name, ~grade,
"Isaac Newton", "A+",
"Albert Einstein", "A+",
"Charles Darwin", "C"
)
inner_join(
grades_geography, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 × 3
## name grade.geography grade.physics
## <chr> <chr> <chr>
## 1 Charles Darwin A C
The
full_join() function
, in contrast, gives you a table containing all keys. If a key is only found in one of the tables, the variables from the other table will be set to
NA.
full_join(
grades_geography, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 5 × 3
## name grade.geography grade.physics
## <chr> <chr> <chr>
## 1 Marko Polo A <NA>
## 2 Charles Darwin A C
## 3 Immanuel Kant A+ <NA>
## 4 Isaac Newton <NA> A+
## 5 Albert Einstein <NA> A+
If you want all keys from the left or right table (but not both left and right)—potentially with
NA if the other table does not have the key—then you need
left_join() or
right_join()
.
left_join(
grades_geography, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 3 × 3
## name grade.geography grade.physics
## <chr> <chr> <chr>
## 1 Marko Polo A <NA>
## 2 Charles Darwin A C
## 3 Immanuel Kant A+ <NA>
right_join(
grades_maths, grades_physics,
by = "name", suffix = c(".maths", ".physics")
)
## # A tibble: 3 × 3
## name grade.maths grade.physics
## <chr> <chr> <chr>
## 1 Isaac Newton A+ A+
## 2 Charles Darwin B C
## 3 Albert Einstein <NA> A+
A
semi_join()
will give you all the rows in the first table that contains a key in the second table.
semi_join(
grades_maths2, grades_biology2,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 4 × 2
## name grade
## <chr> <chr>
## 1 Marko Polo D
## 2 Isaac Newton A+
## 3 Isaac Newton A+
## 4 Charles Darwin B
You only get one copy per row in the first table when a key matches, regardless of how many times the key appears in the second table.
This is unlike
inner_join() where you get all combinations of rows from the two tables. The
inner_join() is, therefore, not an analogue to an
inner_join()
combined with a
select().
inner_join(
grades_maths2, grades_biology2,
by = "name", suffix = c(".geography", ".physics")
) |> select(1:2)
## # A tibble: 5 × 2
## name grade.geography
## <chr> <chr>
## 1 Marko Polo D
## 2 Isaac Newton A+
## 3 Isaac Newton A+
## 4 Charles Darwin B
## 5 Charles Darwin B
You can still get multiple identical rows from a semi_join(). If the first table has duplicated rows, you will get the same duplication of the rows. If you do not want that, you can combine the join with distinct().
An
anti_join()
gives you all the rows in the first table where the key or keys are not found in the second table. Think of it as the complement of
semi_join().
anti_join(
grades_maths2, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 × 2
## name grade
## <chr> <chr>
## 1 Marko Polo D
The join functions only take two tables as input so you might wonder how you can combine multiple tables. One solution is to use
purrr’s
reduce() function
:
grades <- list(
grades_maths, grades_biology,
grades_geography, grades_physics
)
grades |>
reduce(full_join, by = "name") |>
rename_at(2:5, ~ c("maths", "biology", "geography", "physics"))
## # A tibble: 5 × 5
## name maths biology geography physics
## <chr> <chr> <chr> <chr> <chr>
## 1 Marko Polo D F A <NA>
## 2 Isaac Newton A+ D <NA> A+
## 3 Charles Darwin B A+ A C
## 4 Immanuel Kant <NA> <NA> A+ <NA>
## 5 Albert Einstein <NA> <NA> <NA> A+
The rename_at() function
works similarly to select_at(), and here, I use it to rename the last four columns.
Income in Fictional Countries
We had an example with income in fictional countries
in Chapter 5. There, we did not have the proper tools needed to deal with missing data. We wanted to replace NA with the mean income for a country in rows with missing data, but the functions from tidyr didn’t suffice. With dplyr we have the tools.
Recall the data:
mean_income <- tribble(
~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,
"Numenor", 123456, 132654, NA, 324156,
"Westeros", 314256, NA, NA, 465321,
"Narnia", 432156, NA, NA, NA,
"Gondor", 531426, 321465, 235461, 463521,
"Laputa", 14235, 34125, 45123, 51234,
)
The following
pipeline
does what we want. I will explain it as follows, but you can try to work out the steps:
mean_income |>
pivot_longer(
names_to = "year",
values_to = "mean_income",
!country
) |>
group_by(
country
) |>
mutate(
mean_per_country = mean(mean_income, na.rm = TRUE),
mean_income = ifelse(
is.na(mean_income),
mean_per_country,
mean_income
)
) |>
pivot_wider(
names_from = "year",
values_from = "mean_income"
)
## # A tibble: 5 × 6
## # Groups: country [5]
## country mean_per_country `2002` `2003` `2004`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Numenor 193422 123456 132654 193422
## 2 Westeros 389788. 314256 389788. 389788.
## 3 Narnia 432156 432156 432156 432156
## 4 Gondor 387968. 531426 321465 235461
## 5 Laputa 36179. 14235 34125 45123
## # . . . with 1 more variable: `2005` <dbl>
The first step in the
pipeline
is reformatting the data. We saw how this works in Chapter
5.
mean_income |>
pivot_longer(
names_to = "year",
values_to = "mean_income",
!country
)
## # A tibble: 20 × 3
## country year mean_income
## <chr> <chr> <dbl>
## 1 Numenor 2002 123456
## 2 Numenor 2003 132654
## 3 Numenor 2004 NA
## 4 Numenor 2005 324156
## 5 Westeros 2002 314256
## 6 Westeros 2003 NA
## 7 Westeros 2004 NA
## 8 Westeros 2005 465321
## 9 Narnia 2002 432156
## 10 Narnia 2003 NA
## 11 Narnia 2004 NA
## 12 Narnia 2005 NA
## 13 Gondor 2002 531426
## 14 Gondor 2003 321465
## 15 Gondor 2004 235461
## 16 Gondor 2005 463521
## 17 Laputa 2002 14235
## 18 Laputa 2003 34125
## 19 Laputa 2004 45123
## 20 Laputa 2005 51234
We use pivot_longer() to get all the income data from separate columns into a single mean_income column, where the year column will hold the information about which original column the data came from.
In the next two steps, we group by countries, so the means we calculate are per country and not the full data set, and then we compute the means and replace
NA cells with them. To get a glimpse into the
mutate()
call, you can replace it with
summarise() and see the mean of each country.
mean_income |>
pivot_longer(
names_to = "year",
values_to = "mean_income",
!country
) |>
group_by(
country
) |>
summarise(
per_country_mean = mean(mean_income, na.rm = TRUE)
)
## # A tibble: 5 × 2
## country per_country_mean
## <chr> <dbl>
## 1 Gondor 387968.
## 2 Laputa 36179.
## 3 Narnia 432156
## 4 Numenor 193422
## 5 Westeros 389788.
In the next step, we can replace the missing data with their country mean in the
mutate() call because we can refer to earlier columns when we create later columns in
mutate().
mean_income |>
pivot_longer(
names_to = "year",
values_to = "mean_income",
!country
) |>
group_by(
country
) |>
mutate(
mean_per_country = mean(mean_income, na.rm = TRUE),
mean_income = ifelse(
is.na(mean_income),
mean_per_country,
mean_income
)
)
## # A tibble: 20 × 4
## # Groups: country [5]
## country year mean_income mean_per_country
## <chr> <chr> <dbl> <dbl>
## 1 Numenor 2002 123456 193422
## 2 Numenor 2003 132654 193422
## 3 Numenor 2004 193422 193422
## 4 Numenor 2005 324156 193422
## 5 Westeros 2002 314256 389788.
## 6 Westeros 2003 389788. 389788.
## 7 Westeros 2004 389788. 389788.
## 8 Westeros 2005 465321 389788.
## 9 Narnia 2002 432156 432156
## 10 Narnia 2003 432156 432156
## 11 Narnia 2004 432156 432156
## 12 Narnia 2005 432156 432156
## 13 Gondor 2002 531426 387968.
## 14 Gondor 2003 321465 387968.
## 15 Gondor 2004 235461 387968.
## 16 Gondor 2005 463521 387968.
## 17 Laputa 2002 14235 36179.
## 18 Laputa 2003 34125 36179.
## 19 Laputa 2004 45123 36179.
## 20 Laputa 2005 51234 36179.
If you had a pipeline where you had steps between the
group_by() call and the
mutate() call that sets the
mean_income, you have to be careful. Do
not put an
ungroup() in there. The pipeline works because the
mean_income is per country when we call
mutate()
. If we ungrouped, we would get the mean over
all countries.
mean_income |>
pivot_longer(
names_to = "year",
values_to = "mean_income",
!country
) |>
group_by(
country
) |>
# Imagine we are doing something here that removes the group...
ungroup() |>
# The mutate that follows is not grouped so the mean is global...
mutate(
mean_per_country = mean(mean_income, na.rm = TRUE),
mean_income = ifelse(
is.na(mean_income),
mean_per_country,
mean_income
)
)
## # A tibble: 20 × 4
## country year mean_income mean_per_country
## <chr> <chr> <dbl> <dbl>
## 1 Numenor 2002 123456 249185.
## 2 Numenor 2003 132654 249185.
## 3 Numenor 2004 249185. 249185.
## 4 Numenor 2005 324156 249185.
## 5 Westeros 2002 314256 249185.
## 6 Westeros 2003 249185. 249185.
## 7 Westeros 2004 249185. 249185.
## 8 Westeros 2005 465321 249185.
## 9 Narnia 2002 432156 249185.
## 10 Narnia 2003 249185. 249185.
## 11 Narnia 2004 249185. 249185.
## 12 Narnia 2005 249185. 249185.
## 13 Gondor 2002 531426 249185.
## 14 Gondor 2003 321465 249185.
## 15 Gondor 2004 235461 249185.
## 16 Gondor 2005 463521 249185.
## 17 Laputa 2002 14235 249185.
## 18 Laputa 2003 34125 249185.
## 19 Laputa 2004 45123 249185.
## 20 Laputa 2005 51234 249185.
In the last step, we just use pivot_wider() to transform the long table back to the original format; there is nothing special here that you haven’t seen in Chapter 5.
This is how we replace all missing data with the mean for the corresponding countries.