© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_8

8. Manipulating Data Frames: dplyr

Thomas Mailund1  
(1)
Aarhus, Denmark
 
The dplyr package resembles the functionality in the purrr package, but it is designed for manipulating data frames. It will be loaded with the tidyverse package, but you can also load it using
library(dplyr)

The usual way that you use dplyr is similar to how you use purrr. You string together a sequence of actions in a pipeline, with the actions separated by the %>% operator. The difference between the two packages is that the purrr functions work on sequences while the dplyr functions work on data frames.

This package is huge and more functionality is added in each new release, so it would be impossible for me to describe everything you can do with dplyr. I can only give you a flavor of the package functionality and refer you to the documentation for more information.

Selecting Columns

One of the simplest operations you can do on a data frame is selecting one or more of its columns . You can do this by indexing with $:
iris_df <- as_tibble(iris)
print(iris_df, n = 3)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
head(iris_df$Species)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

This, however, does not work well with pipelines where a data frame is flowing through from action to action. In dplyr , you have the function select() for picking out columns. We explored it in Chapter 4, but let’s see a few examples more.

You can give the function one or more column names, and it will give you a data frame containing only those columns.
iris_df |>
    select(Sepal.Length, Species) |>
    print(n = 3)
## # A tibble: 150 × 2
##   Sepal.Length Species
##          <dbl> <fct>
## 1          5.1 setosa
## 2          4.9 setosa
## 3          4.7 setosa
## # . . . with 147 more rows
You can also use complements when selecting columns; just put a ! in front of the column names .
iris_df |>
    select(!Species) |>
    print(n = 3)
## # A tibble: 150 × 4
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 1 more variable:
## # Petal.Width <dbl>
iris_df |>
    select(!c(Species, Sepal.Length)) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Sepal.Width Petal.Length Petal.Width
##         <dbl>        <dbl>       <dbl>
## 1         3.5          1.4         0.2
## 2         3            1.4         0.2
## 3         3.2          1.3         0.2
## # . . . with 147 more rows
You can also use minus to remove columns from a selection:
iris_df |>
    select(-Species) |>
    print(n = 3)
## # A tibble: 150 × 4
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 1 more variable:
## # Petal.Width <dbl>
iris_df |>
    select(-c(Species, Sepal.Length)) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Sepal.Width Petal.Length Petal.Width
##         <dbl>        <dbl>       <dbl>
## 1         3.5          1.4         0.2
## 2         3            1.4         0.2
## 3         3.2          1.3         0.2
## # . . . with 147 more rows

but ! and - do not behave exactly the same way when you provide more than one argument to select().

With ! you ask for all columns except those you exclude, so if you write
iris_df |>
    select(!Species, !Sepal.Length) |>
    print(n = 3)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>

you first ask for all columns except Species and then for all columns except Sepal.Length , and you get the union of the two selections (which includes everything).

With minus, you ask for the columns to be removed instead, so with
iris_df |>
    select(-Species, -Sepal.Length) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Sepal.Width Petal.Length Petal.Width
##         <dbl>        <dbl>       <dbl>
## 1         3.5          1.4         0.2
## 2         3            1.4         0.2
## 3         3.2          1.3         0.2
## # . . . with 147 more rows

you first remove Species and then Sepal.Length so neither are included in the final table.

You do not need to use column names. You can also use indices.
iris_df |>
    select(1) |>
    print(n = 3)
## # A tibble: 150 × 1
##   Sepal.Length
##          <dbl>
## 1          5.1
## 2          4.9
## 3          4.7
## # . . . with 147 more rows
iris_df |>
    select(1:3) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows
This, of course, is less informative about what is being selected when you read the code. The second selection earlier shows that you can extract ranges using indices the same way as we can using column names.
iris_df |>
   select(Petal.Length:Species) |>
   print(n = 3)
## # A tibble: 150 × 3
##   Petal.Length Petal.Width Species
##          <dbl>       <dbl> <fct>
## 1          1.4         0.2 setosa
## 2          1.4         0.2 setosa
## 3          1.3         0.2 setosa
## # . . . with 147 more rows
All the other functions described in Chapter 4 work for select() , of course:
iris_df |>
    select(starts_with("Petal")) |>
    print(n = 3)
## # A tibble: 150 × 2
##   Petal.Length Petal.Width
##          <dbl>       <dbl>
## 1          1.4         0.2
## 2          1.4         0.2
## 3          1.3         0.2
## # . . . with 147 more rows
iris_df |>
    select(-starts_with("Petal")) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl>   <fct>
## 1          5.1         3.5 setosa
## 2          4.9         3   setosa
## 3          4.7         3.2 setosa
## # . . . with 147 more rows
iris_df |>
    select(starts_with("Petal"), Species) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Petal.Length Petal.Width Species
##          <dbl>       <dbl> <fct>
## 1          1.4         0.2 setosa
## 2          1.4         0.2 setosa
## 3          1.3         0.2 setosa
## # . . . with 147 more rows
iris_df |>
    select(starts_with("PETAL", ignore.case = TRUE)) |>
    print(n = 3)
## # A tibble: 150 × 2
##   Petal.Length Petal.Width
##          <dbl>       <dbl>
## 1          1.4         0.2
## 2          1.4         0.2
## 3          1.3         0.2
## # . . . with 147 more rows
iris_df |>
    select(starts_with("S")) |>
    print(n = 3)
## # A tibble: 150 × 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <fct>
## 1          5.1         3.5 setosa
## 2          4.9         3   setosa
## 3          4.7         3.2 setosa
## # . . . with 147 more rows
iris_df |>
    select(ends_with("Length")) |>
    print(n = 3)
## # A tibble: 150 × 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## # . . . with 147 more rows
iris_df |>
    select(contains("ng")) |>
    print(n = 3)
## # A tibble: 150 × 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## # . . . with 147 more rows
You can also use select() to rename columns. If you use a named parameter, the columns you select will get the names you use for the function parameters.
iris_df |>
    select(sepal_length = Sepal.Length,
           sepal_width = Sepal.Width) |>
    print(n = 3)
## # A tibble: 150 × 2
##   sepal_length sepal_width
##          <dbl>       <dbl>
## 1          5.1         3.5
## 2          4.9         3
## 3          4.7         3.2
## # . . . with 147 more rows
Since select() removes the columns you do not select, it might not always be your best choice for what you need to do. A similar function to select(), where you do not lose columns, is rename() . With this function, you can rename columns and keep the remaining columns as well.
iris_df |>
    rename(sepal_length = Sepal.Length,
           sepal_width = Sepal.Width) |>
    print(n = 3)
## # A tibble: 150 × 5
##   sepal_length sepal_width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>

Filter

While select() extracts a subset of columns , the filter() function does the same for rows.

The iris_df contains three different species. We can see this using the distinct() function . This function is also from the dplyr package, and it gives you all the unique rows from selected columns.
iris_df |> distinct(Species)
## # A tibble: 3 × 1
##   Species
##   <fct>
## 1 setosa
## 2 versicolor
## 3 virginica
We can select the rows where the species is “setosa” using filter():
iris_df |>
    filter(Species == "setosa") |>
    print(n = 3)
## # A tibble: 50 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 47 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We can combine filter() and select() in a pipeline to get a subset of the columns as well as a subset of the rows.
iris_df |>
    filter(Species == "setosa") |>
    select(ends_with("Length"), Species) |>
    print(n = 3)
## # A tibble: 50 × 3
##   Sepal.Length Petal.Length Species
##          <dbl>        <dbl> <fct>
## 1          5.1          1.4 setosa
## 2          4.9          1.4 setosa
## 3          4.7          1.3 setosa
## # . . . with 47 more rows
Generally, we can string together as many dplyr (or purrr) functions as we desire in pipelines.
iris_df |>
    filter(Species != "setosa") |>
    distinct(Species) |>
    print(n = 3)
## # A tibble: 2 × 1
##   Species
##   <fct>
## 1 versicolor
## 2 virginica
We can use more than one column to filter by:
iris_df |>
    filter(Sepal.Length > 5, Petal.Width < 0.4) |>
    print(n = 3)
## # A tibble: 15 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl> <dbl> <dbl>
## 1          5.1   3.5   1.4
## 2          5.4   3.7   1.5
## 3          5.8   4     1.2
## # . . . with 12 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We can also use functions as a predicate, for example, the between() function to select numbers in a given range.
iris_df |>
    filter(between(Sepal.Width, 2, 2.5)) |>
    print(n = 3)
## # A tibble: 19 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.5         2.3          1.3
## 2          5.5         2.3          4
## 3          4.9         2.4          3.3
## # . . . with 16 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We cannot use the functions starts_with(), ends_with(), etc. that we can use with select(). This does not mean, however, that we cannot filter rows using string patterns. We just need different functions. For example, we can use str_starts() from the stringr package . We return to stringr in Chapter 9.
iris_df |>
    filter(str_starts(Species, "v")) |>
    print(n = 3)
## # A tibble: 100 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          7           3.2          4.7
## 2          6.4         3.2          4.5
## 3          6.9         3.1          4.9
## # . . . with 97 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df |>
    filter(str_ends(Species, "r")) |>
    print(n = 3)
## # A tibble: 50 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          7           3.2          4.7
## 2          6.4         3.2          4.5
## 3          6.9         3.1          4.9
## # . . . with 47 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>

While filter() selects rows by applying predicates on individual columns, other filter variants will use predicates over more than one column.

The filter_all() function will apply the predicate over all columns. You must provide an expression where you use . (dot) for the table cell value and wrap the expression in one of two functions: any_var() or all_var() . If you use any_var(), it suffices that one column satisfies the predicate for the row to be included; if you use all_var(), then all columns must satisfy the predicate.

We can require that any value in the iris_df data must be larger than five:
iris_df |>
    select(-Species) |>
    filter_all(any_vars(. > 5)) |>
    print(n = 3)
## # A tibble: 118 × 4
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          5.4         3.9          1.7
## 3          5.4         3.7          1.5
## # . . . with 115 more rows, and 1 more variable:
## # Petal.Width <dbl>
The expression . > 5 is not meaningful if . is a string, so I had to remove the Species column before I filtered. Rarely will we accept to lose an informative variable. If you want to keep a column, but not apply the predicate to values in it, you can use the filter_at() function .
iris_df |>
    filter_at(vars(-Species), any_vars(. > 5)) |>
    print(n = 3)
## # A tibble: 118 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          5.4         3.9          1.7
## 3          5.4         3.7          1.5
## # . . . with 115 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can give filter_at() a vector of strings or a vector of variables (unquoted column names). Because I need to negate Species to select all other columns, I used vars() here. In other cases, you can use either option.
iris_df |>
    filter_at(c("Petal.Length", "Sepal.Length"),
    any_vars(. > 0)) |>
    print(n = 3)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df |>
    filter_at(vars(Petal.Length, Sepal.Length),
    any_vars(. > 0)) |>
    print(n = 3)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # . . . with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can use filter_if() to use a predicate over the columns:
iris_df |>
    filter_if(is.numeric, all_vars(. < 5)) |>
    print(n = 3)
## # A tibble: 22 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.9          3           1.4
## 2          4.7          3.2         1.3
## 3          4.6          3.1         1.5
## # . . . with 19 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>

You can use functions or lambda expressions as the predicate.

Consider a case where we have numbers with missing data in a table. If we want to filter the rows where all values are greater than three, we can use filter_all() :
df <- tribble(
   ~A, ~B, ~C,
    1,  2,  3,
    4,  5, NA,
   11, 12, 13,
   22, 22,  1
)
df |> filter_all(all_vars(. > 3))
## # A tibble: 1 × 3
##       A    B      C
##   <dbl> <dbl> <dbl>
## 1    11    12    13
It removes the first two and the last row because they all contain values smaller or equal to three. It also deletes the third row because NA is not considered greater than three. We can restrict the tests to the columns that do not contain missing values:
df |> filter_if(~ all(!is.na(.)), all_vars(. > 3))
## # A tibble: 3 × 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     4     5    NA
## 2    11    12    13
## 3    22    22    1

This removes column C from the tests. It means that we also keep the last row even though C has a value smaller than three.

If you want to keep the two middle row but not the first or last, you can write a more complex predicate and use filter_all() .
df |> filter_all(all_vars(is.na(.) | . > 3))
## # A tibble: 2 × 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     4     5    NA
## 2    11    12    13

You need to use the | vector-or. The expression needs to be a vector expression so you cannot use ||.

Many dplyr functions have _all, _at, and _if variants, but I will only describe the frequently used functions here. For the others, I refer to the package documentation.

Sorting

If you want to sort rows by values in selected columns, the function you want is called arrange() . You can sort by one or more columns:
iris_df |>
    arrange(Petal.Length) |>
    print(n = 5)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.6         3.6          1
## 2          4.3         3            1.1
## 3          5.8         4            1.2
## 4          5           3.2            1.2
## 5          4.7         3.2          1.3
## # . . . with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df |>
    arrange(Sepal.Length, Petal.Length) |>
    print(n = 5)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.3         3            1.1
## 2          4.4         3            1.3
## 3          4.4         3.2          1.3
## 4          4.4         2.9          1.4
## 5          4.5         2.3          1.3
## # . . . with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
If you want to sort in descending order, you can use the function desc() .
iris_df |>
    arrange(desc(Petal.Length)) |>
    print(n = 5)
## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          7.7         2.6          6.9
## 2          7.7         3.8          6.7
## 3          7.7         2.8          6.7
## 4          7.6         3            6.6
## 5          7.9         3.8          6.4
## # . . . with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>

Modifying Data Frames

When we work with data frames, we usually want to compute values based on the variables (columns) in our tables. Filtering rows and columns will only get us so far.

The mutate() function lets us add columns to a data frame based on expressions that can involve any of the existing columns. Consider a table of widths and heights.
df <- tribble(
    ~height, ~width,
         10,     12,
         42,     24,
         14,     12
)
We can add an area column using mutate() and this expression:
df |> mutate(area = height * width)
## # A tibble: 3 × 3
##   height  width  area
##    <dbl>  <dbl> <dbl>
## 1     10     12   120
## 2     42     24  1008
## 3     14     12   168
If you add columns to a data frame, you can refer to variables you have already added. For example, if your height and width data are in inches and you want them in centimeters, and you also want the area in centimeters squared, you can do this:
cm_per_inch <- 2.54
df |> mutate(
    height_cm = cm_per_inch * height,
    width_cm = cm_per_inch * width,
    area_cm = height_cm * width_cm
)
## # A tibble: 3 × 5
##   height width height_cm width_cm area_cm
##    <dbl> <dbl>      <dbl>    <dbl>  <dbl>
## 1     10    12       25.4      30.5  774.
## 2     42    24      107.       61.0 6503.
## 3     14    12       35.6      30.5 1084.
The following expression, however, will not work since you cannot refer to variables you are adding later in the mutate() call:
df %>% mutate(
    area_cm = height_cm * width_cm,
    height_cm = cm_per_inch * height,
    width_cm = cm_per_inch * width
)
## Error in `mutate()`:
## ! Problem while computing `area_cm = height_cm * width_cm`.
## Caused by error:
## ! object 'height_cm' not found
If you do not give the new variable a name, that is, you call mutate() with named parameters , then the expression will become the variable name:
df |> mutate(cm_per_inch * height)
## # A tibble: 3 × 3
##   height width `cm_per_inch * height`
##    <dbl> <dbl>                  <dbl>
## 1     10    12                   25.4
## 2     42    24                  107.
## 3     14    12                   35.6
In this example, the units (inches and centimeters) are not encoded with the data, and in future computations you might end up mixing the wrong units. If you use the units package —which is not part of the Tidyverse—you can make the units explicit, automatically convert between units where that makes sense, and you will get unit type-safety as a bonus.
library(units)
df |> mutate(
    height_in = units::as_units(height, "in"),
    width_in = units::as_units(width, "in"),
    area_in = height_in * width_in,
    height_cm = units::set_units(height_in, "cm"),
    width_cm = units::set_units(width_in, "cm"),
    area_cm = units::set_units(area_in, "cmˆ2")
)
## # A tibble: 3 × 8
##   height width height_in width_in area_in
##    <dbl> <dbl>       [in]     [in] [in^2]
## 1     10    12        10        12    120
## 2     42    24        42        24   1008
## 3     14    12        14        12    168
## # . . . with 3 more variables: height_cm [cm],
## # width_cm [cm], area_cm [cm^2]

This is possible because tibbles can contain classes of various types, which the units package exploits. Tibbles use an interface that lets you define how your own classes should be displayed,1 but this is beyond the scope of this book.

Earlier, we created unit-carrying values for the original data table, but the original height and weight columns are still around. We could use select() to remove them, but the function transmute() already does this for us.
df |> transmute(
    height_in = units::as_units(height, "in"),
    width_in = units::as_units(width, "in"),
    area_in = height_in * width_in,
    height_cm = units::set_units(height_in, "cm"),
    width_cm = units::set_units(width_in, "cm"),
    area_cm = units::set_units(area_in, "cmˆ2")
)
## # A tibble: 3 × 6
##   height_in width_in area_in height_cm width_cm
##         [in]     [in]  [in^2]      [cm]    [cm]
## 1        10       12     120       25.4    30.5
## 2        42       24    1008      107.     61.0
## 3        14       12     168       35.6    30.5
## # . . . with 1 more variable: area_cm [cm^2]

When you use transmute() instead of mutate(), the variables that you are not assigning to will be removed.

As with filtering, the expressions you use when mutating a data frame must be vector expressions. Consider this example:
df <- tibble(
    x = rnorm(3, mean = 12, sd = 5),
)
df |> mutate(~ if (.x < 0) -.x else .x)
## Error in `mutate()`:
## ! Problem while computing `..1 = ~if (.x < 0) -.x else .x`.
## x `..1` must be a vector, not a `formula` object.

It fails because the function my_abs() (my version of the abs() function) is not a vector expression. It is not a vector expression because of the x in the if expression.

If we use the built-in abs() function , then there is no problem with the expression; it handles vectors.
df |> mutate(abs(x))
## # A tibble: 3 × 2
##       x `abs(x)`
##   <dbl> <dbl>
## 1  11.5  11.5
## 2  20.9  20.9
## 3  17.2  17.2
If you use the ifelse() vector version of if expressions, you get a vector expression.
df |> mutate(ifelse(x < 0, -x, x))
## # A tibble: 3 × 2
##       x `ifelse(x < 0, -x, x)`
##   <dbl> <dbl>
## 1  11.5  11.5
## 2  20.9  20.9
## 3  17.2  17.2
Or you can use the Vectorize() function to make a vector expression out of a function that does not handle vectors.
my_abs <- Vectorize((x) if (x < 0) -x else x)
df |> mutate(my_abs(x))
## # A tibble: 3 × 2
##      x `my_abs(x)`
##   <dbl>      <dbl>
## 1  11.5       11.5
## 2  20.9       20.9
## 3  17.2       17.2

If you need to map the input to several different output values, you can nest ifelse() expressions arbitrarily deep, but it gets difficult to read. A function that alleviates this problem substantially is case_when(). You can give it a sequence of predicates with matching expressions, and it will return the expression for the first matching predicate.

Consider this example where we use case_when() to categorize the variable x based on its value:
df <- tibble(x = rnorm(100))
df |>
    mutate(
       x_category = case_when(
           x - mean(x) < -2 * sd(x) ~ "small",
           x - mean(x) >  2 * sd(x) ~ "large",
           TRUE                     ~ "medium"
       )
    ) |>
    print(n = 3)
## # A tibble: 100 × 2
##        x x_category
##     <dbl> <chr>
## 1  -0.736 medium
## 2  -0.478 medium
## 3   0.796 medium
## # . . . with 97 more rows

The TRUE line corresponds to the else part of an if-else statement.

Grouping and Summarizing

In many analyses, we need summary statistics of our data. If you can map one or more of your columns into a single summary, you can use the summarise() function .
df <- tibble(x = rnorm(100), y = rnorm(100))
df |> summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 1 × 2
##   mean_x mean_y
##    <dbl>  <dbl>
## 1 -0.154 0.0224

In this example, we summarized our data as the mean of variables x and y.

If you split your data into different classes and want to work on the data per group, you can use the function group_by() .
classify <- function(x) {
    case_when(
        x - mean(x) < -2 * sd(x) ~ "small",
        x - mean(x) >  2 * sd(x) ~ "large",
        TRUE                     ~ "medium"
    )
}
df |> mutate(x_category = classify(x)) |>
    group_by(x_category) |>
    print(n = 3)
## # A tibble: 100 × 3
## # Groups:    x_category [3]
##        x       y x_category
##    <dbl>  <dbl>  <chr>
## 1  -2.08 -1.52  medium
## 2  0.580 -0.990 medium
## 3 -0.995 -0.443 medium
## # . . . with 97 more rows
The result is a data frame that, when we print it, doesn’t look different from before we grouped the data. In the header, however, you will notice the line:
# Groups:   x_category [3]
This tells you that something is different when you have grouped your data. This is apparent when you combine grouping with summarizing.
df |> mutate(x_category = classify(x)) |>
      group_by(x_category) |>
      summarise(mean_x = mean(x), no_x = n())
## # A tibble: 3 × 3
##   x_category mean_x  no_x
##   <chr>       <dbl> <int>
## 1 large       2.04      2
## 2 medium     -0.155    96
## 3 small      -2.30      2

When you group your data and then summarize, you get a per-group summary statistics. Here, we calculate the mean and the number of observations in each category (the function n() gives you the number of observations).

You can get the variables your data is grouped by using the group_vars() function .
df |> mutate(x_category = classify(x)) |>
      group_by(x_category) |>
      group_vars()
## [1] "x_category"
You can group data by more than one variable .
df <- tibble(x = rnorm(100), y = rnorm(100)) |>
      mutate(
        x_category = classify(x),
        y_category = classify(y)
      )
df |> group_by(x_category, y_category) |>
      group_vars()
## [1] "x_category" "y_category"
If you do, you will get summaries for each combination of the grouping variables .
df |> group_by(x_category, y_category) |>
      summarise(mean_x = mean(x), mean_y = mean(y))
## `summarise()` has grouped output by 'x_category'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 4
## # Groups: x_category [3]
##   x_category y_category mean_x mean_y
##   <chr>    <chr>      <dbl>  <dbl>
## 1 large    medium    2.08     1.33
## 2 medium   large     0.808    2.69
## 3 medium   medium    0.0755  0.141
## 4 medium   small     1.05    -2.38
## 5 small    medium   -2.55    -1.12

The notice you get that the output has “grouped output by ‘x_category’” means just that: the table we get out from the call to summarise() is still grouped by x_category, as you can also see in the printed output of the result (in the Groups: x_category [3] line in the header).

When you group by multiple variables, it works in effect as if you add groups one by one, and you can also do this via multiple calls to group_by(). There, however, you need the argument .add = TRUE to add a group rather than replacing one, which is what group_by() will do without this argument.
# Adding two categories at the same time
df |> group_by(x_category, y_category) |> group_vars()
## [1] "x_category" "y_category"
# Adding categories in two steps
df |> group_by(x_category) |> group_by(y_category, .add = TRUE) |> group_vars()
## [1] "x_category" "y_category"
# Replacing one grouping with another
df |> group_by(x_category) |> group_by(y_category) |> group_vars()
## [1] "y_category"

When you summarize, you remove the last grouping you created—the summary only has one row per value in that group anyway, so it isn’t useful any longer—but the other groups remain. This was the default behavior in early versions of dplyr, and still is, but is a common source of errors. Some data scientists expect this behavior, others expect that all groups are removed, and if your expectations are not met, there is a good chance that the following analysis will be wrong; if you think you are working on the data as a whole, but all operations are grouped by one or more variables, you are unlikely to get the results that you want.

Because of this, summarise() now takes the argument .groups where you can specify the behavior you want. It is not yet required, since that would break backward compatibility to a lot of code, but you get the message about using it that we saw a little while up.
# Drop the last group we made
df |> group_by(x_category, y_category) |>
      summarise(mean_x = mean(x), mean_y = mean(y), .groups = "drop_last") |>
      group_vars()
## [1] "x_category"
# Drop all groups
df |> group_by(x_category, y_category) |>
      summarise(mean_x = mean(x), mean_y = mean(y), .groups = "drop") |>
      group_vars()
## character(0)
# Keep all groups
df |> group_by(x_category, y_category) |>
      summarise(mean_x = mean(x), mean_y = mean(y), .groups = "keep") |>
      group_vars()
## [1] "x_category" "y_category"
If you have grouped your data, or you get grouped data from one summary as before, you can remove the groups again using ungroup() . Once you have done this, you can compute global summaries again.
df |> group_by(x_category, y_category) |>
      ungroup() |> # Remove the grouping again
      group_vars()
## character(0)
Grouping is not only useful when collecting summaries of your data. You can also add columns to your data frame based on per-group computations using mutate() :
df |> group_by(x_category) |>
      mutate(mean_x = mean(x), mean_y = mean(y)) |>
      ungroup() |>
      print(n = 5)
## # A tibble: 100 × 6
##        x         y x_category y_category mean_x
##    <dbl>     <dbl> <chr>      <chr>       <dbl>
## 1 -0.416   -0.374  medium     medium      0.123
## 2 -0.845   -1.21   medium     medium      0.123
## 3 -0.0176   0.103  medium     medium      0.123
## 4 -1.18    -0.0132 medium     medium      0.123
## 5  0.508    0.504  medium     medium      0.123
## # . . . with 95 more rows, and 1 more variable:
## #   mean_y <dbl>

When you calculate a summary and then mutate, you create columns where a summary parameter takes values based on the groupings, but they are combined with the existing data rather than extracted as a new data frame as summarise() would do.

A mutate() without a group_by() will give you summaries for the entire data.
df |> mutate(mean_x = mean(x), mean_y = mean(y)) |> # Calculate mean x and y for entire data
      distinct(mean_x, mean_y)                      # Get unique values for printing
## # A tibble: 1 × 2
##   mean_x mean_y
##    <dbl>  <dbl>
## 1 0.0895 0.0520

I have used distinct() to get the unique values of r mean_x and mean_y combinations, as this well illustrate the result. Here, we only have one row of values, and this is because we have computed the mean values for the entire data set and not for different groupings of the data.

In contrast, if you group before adding variables to the data, then the summaries are per group.
df |> group_by(x_category) |>                       # Group by x categories only
      mutate(mean_x = mean(x), mean_y = mean(y)) |> # Calculate mean x and y for each x group
      distinct(mean_x, mean_y)                      # Get unique values for printing
## # A tibble: 3 × 3
## # Groups:   x_category [3]
##   x_category mean_x  mean_y
##   <chr>       <dbl>   <dbl>
## 1 medium      0.123  0.0631
## 2 large       2.08   1.33
## 3 small      -2.55  -1.12

Here, there are values for different x_category factors, showing us that we have computed means for each different x_category in the data.

You can combine data-wide summaries with grouped summaries if you calculate the global summaries before you group.
df |> mutate(mean_y = mean(y)) |> # Get the mean for y over all data points
      group_by(x_category)     |> # Then group so the following works per group
      mutate(mean_x = mean(x)) |> # Get the mean x for each group (not globally)
      distinct(mean_x, mean_y)    # Get the unique values so we can print the result
## # A tibble: 3 × 3
## # Groups:    x_category [3]
##   x_category mean_y  mean_x
##   <chr>       <dbl>   <dbl>
## 1 medium     0.0520   0.123
## 2 large      0.0520   2.08
## 3 small      0.0520  -2.55

Notice that the mean_y values are the same in the output rows; this is because we have computed the mean globally and not for each group.

If you need to compute summaries with different groups, for example, the mean of x for each x_category as well as the mean of y for each y_category, then you can call group_by(x_category) for summarizing x, followed by group_by( y_category) for changing the grouping so you can summarize y.
df |> group_by(x_category) |>
      mutate(mean_x = mean(x)) |>
      group_by(y_category) |>
      mutate(mean_y = mean(y)) |>
      distinct(
        x_category, mean_x,
        y_category, mean_y
      )
## # A tibble: 5 × 4
## # Groups: y_category [3]
##   x_category y_category mean_x mean_y
##   <chr>    <chr>     <dbl>      <dbl>
## 1 medium   medium    0.123      0.127
## 2 medium   small     0.123     -2.38
## 3 large    medium    2.08       0.127
## 4 medium   large     0.123      2.69
## 5 small    medium   -2.55       0.127

Joining Tables

It is not uncommon to have your data in more than one table. This could be because the tables are created from different calculations, for example, different kinds of summaries, or it can be because you got the data from different files.

If you merely need to combine tables by row or column, then you can use the bind_rows() and bind_columns() functions which do precisely what you would expect.
df1 <- tibble(
    A = paste0("a", 1:2),
    B = paste0("b", 1:2)
)
df2 <- tibble(
    A = paste0("a", 3:4),
    B = paste0("b", 3:4)
)
df3 <- tibble(
    C = paste0("c", 1:2),
    D = paste0("d", 1:2)
)
bind_rows(df1, df2)
## # A tibble: 4 × 2
##   A     B
##   <chr> <chr>
## 1 a1    b1
## 2 a2    b2
## 3 a3    b3
## 4 a4    b4
bind_cols(df1, df3)
## # A tibble: 2 × 4
##   A     B     C     D
##   <chr> <chr> <chr> <chr>
## 1 a1    b1    c1    d1
## 2 a2    b2    c2    d2

When you combine rows, the tables must have the same columns, and when you combine by column, the tables must have the same number of rows.

If you have tables that represent different relations between variables—the underlying principle of relational databases aimed at avoiding duplicated data—then you can combine them using join functions.

Say you have a table that maps students to grades for each class you teach. You can join tables from two different classes using inner_join() . You use a key to join them on, specified by argument by.
grades_maths <- tribble(
    ~name, ~grade,
    "Marko Polo", "D",
    "Isaac Newton", "A+",
    "Charles Darwin", "B"
)
grades_biology <- tribble(
    ~name, ~grade,
    "Marko Polo", "F",
    "Isaac Newton", "D",
    "Charles Darwin", "A+"
)
inner_join(grades_maths, grades_biology, by = "name")
## # A tibble: 3 × 3
##   name           grade.x grade.y
##   <chr>          <chr>   <chr>
## 1 Marko Polo     D       F
## 2 Isaac Newton   A+      D
## 3 Charles Darwin B       A+

This tells inner_join() that you want to combine all rows in the first table with all rows in the second, where the two rows have the same name. You can use more than one key in a join if you give by a vector of variable names.

Earlier, each name appears once per table. If a key appears more than once, then the result of an inner join will have a list with all combinations of rows sharing a name.
grades_maths2 <- tribble(
    ~name, ~grade,
    "Marko Polo", "D",
    "Isaac Newton", "A+", # so good at physics
    "Isaac Newton", "A+", # that he got an A+ twice
    "Charles Darwin", "B"
)
grades_biology2 <- tribble(
    ~name, ~grade,
    "Marko Polo", "F",
    "Isaac Newton", "D",
    "Charles Darwin", "A+", # so good at biology that we
    "Charles Darwin", "A+" # listed him twice
)
inner_join(grades_maths2, grades_biology2, by = "name")
## # A tibble: 5 × 3
##   name            grade.x grade.y
##   <chr>           <chr>   <chr>
## 1 Marko Polo      D       F
## 2 Isaac Newton    A+      D
## 3 Isaac Newton    A+      D
## 4 Charles Darwin  B       A+
## 5 Charles Darwin  B       A+
inner_join(grades_maths2, grades_biology2, by = "grade")
## # A tibble: 5 × 3
##   name.x          grade name.y
##   <chr>           <chr> <chr>
## 1 Marko Polo      D     Isaac Newton
## 2 Isaac Newton    A+    Charles Darwin
## 3 Isaac Newton    A+    Charles Darwin
## 4 Isaac Newton    A+    Charles Darwin
## 5 Isaac Newton    A+    Charles Darwin
In the last join, you see that you can get the same line multiple times from an inner join. Combine the join with distinct() if you want to avoid this.
inner_join(grades_maths2, grades_biology2, by = "grade") |> distinct()
## # A tibble: 2 × 3
##   name.x          grade name.y
##   <chr>           <chr> <chr>
## 1 Marko Polo      D     Isaac Newton
## 2 Isaac Newton    A+    Charles Darwin

The tables can have different variables—if they represent different relations—or they can share several or all variables. In the grade example, each table contains the same name/grade relationships, just for different classes. Therefore, they also share variables.

With shared variables, the data from the two tables can be differentiated by a suffix. The default, see above, is “.x” and “.y”. You can change that using the suffix argument .
inner_join(
    grades_maths, grades_biology,
    by = "name", suffix = c(".maths", ".biology")
)
## # A tibble: 3 × 3
##   name            grade.maths grade.biology
##   <chr>           <chr>       <chr>
## 1 Marko Polo      D           F
## 2 Isaac Newton    A+          D
## 3 Charles Darwin  B           A+

Students might take different classes, and you have several choices on how to combine tables with different keys.

An inner_join() will only give you the rows where a key is in both tables.
grades_geography <- tribble(
    ~name, ~grade,
    "Marko Polo", "A",
    "Charles Darwin", "A",
    "Immanuel Kant", "A+"
)
grades_physics <- tribble(
    ~name, ~grade,
    "Isaac Newton", "A+",
    "Albert Einstein", "A+",
    "Charles Darwin", "C"
)
inner_join(
    grades_geography, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 × 3
##   name           grade.geography grade.physics
##   <chr>          <chr>           <chr>
## 1 Charles Darwin A               C
The full_join() function , in contrast, gives you a table containing all keys. If a key is only found in one of the tables, the variables from the other table will be set to NA.
full_join(
    grades_geography, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 5 × 3
##   name            grade.geography grade.physics
##   <chr>           <chr>           <chr>
## 1 Marko Polo      A               <NA>
## 2 Charles Darwin  A               C
## 3 Immanuel Kant   A+              <NA>
## 4 Isaac Newton    <NA>            A+
## 5 Albert Einstein <NA>            A+
If you want all keys from the left or right table (but not both left and right)—potentially with NA if the other table does not have the key—then you need left_join() or right_join() .
left_join(
    grades_geography, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 3 × 3
##   name            grade.geography grade.physics
##   <chr>           <chr>           <chr>
## 1 Marko Polo      A               <NA>
## 2 Charles Darwin  A               C
## 3 Immanuel Kant   A+              <NA>
right_join(
    grades_maths, grades_physics,
    by = "name", suffix = c(".maths", ".physics")
)
## # A tibble: 3 × 3
##   name            grade.maths grade.physics
##   <chr>           <chr>       <chr>
## 1 Isaac Newton    A+          A+
## 2 Charles Darwin  B           C
## 3 Albert Einstein <NA>        A+
A semi_join() will give you all the rows in the first table that contains a key in the second table.
semi_join(
    grades_maths2, grades_biology2,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 4 × 2
##   name            grade
##   <chr>           <chr>
## 1 Marko Polo      D
## 2 Isaac Newton    A+
## 3 Isaac Newton    A+
## 4 Charles Darwin  B

You only get one copy per row in the first table when a key matches, regardless of how many times the key appears in the second table.

This is unlike inner_join() where you get all combinations of rows from the two tables. The inner_join() is, therefore, not an analogue to an inner_join() combined with a select().
inner_join(
    grades_maths2, grades_biology2,
    by = "name", suffix = c(".geography", ".physics")
) |> select(1:2)
## # A tibble: 5 × 2
##   name            grade.geography
##   <chr>           <chr>
## 1 Marko Polo      D
## 2 Isaac Newton    A+
## 3 Isaac Newton    A+
## 4 Charles Darwin  B
## 5 Charles Darwin  B

You can still get multiple identical rows from a semi_join(). If the first table has duplicated rows, you will get the same duplication of the rows. If you do not want that, you can combine the join with distinct().

An anti_join() gives you all the rows in the first table where the key or keys are not found in the second table. Think of it as the complement of semi_join().
anti_join(
    grades_maths2, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 × 2
##   name            grade
##   <chr>           <chr>
## 1 Marko Polo      D
The join functions only take two tables as input so you might wonder how you can combine multiple tables. One solution is to use purrr’s reduce() function :
grades <- list(
    grades_maths, grades_biology,
    grades_geography, grades_physics
)
grades |>
    reduce(full_join, by = "name") |>
    rename_at(2:5, ~ c("maths", "biology", "geography", "physics"))
## # A tibble: 5 × 5
##   name            maths biology geography physics
##   <chr>           <chr> <chr>   <chr>     <chr>
## 1 Marko Polo      D     F       A         <NA>
## 2 Isaac Newton    A+    D       <NA>      A+
## 3 Charles Darwin  B     A+      A         C
## 4 Immanuel Kant   <NA>  <NA>    A+        <NA>
## 5 Albert Einstein <NA>  <NA>    <NA>      A+

The rename_at() function works similarly to select_at(), and here, I use it to rename the last four columns.

Income in Fictional Countries

We had an example with income in fictional countries in Chapter 5. There, we did not have the proper tools needed to deal with missing data. We wanted to replace NA with the mean income for a country in rows with missing data, but the functions from tidyr didn’t suffice. With dplyr we have the tools.

Recall the data:
mean_income <- tribble(
    ~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,
    "Numenor",  123456,  132654,     NA,  324156,
    "Westeros", 314256,  NA,         NA,  465321,
    "Narnia",   432156,  NA,         NA,      NA,
    "Gondor",   531426,  321465, 235461,  463521,
    "Laputa",    14235,   34125,  45123,   51234,
)
The following pipeline does what we want. I will explain it as follows, but you can try to work out the steps:
mean_income                                                     |>
    pivot_longer(
        names_to = "year",
        values_to = "mean_income",
        !country
    )                                                           |>
    group_by(
        country
    )                                                           |>
    mutate(
        mean_per_country = mean(mean_income, na.rm = TRUE),
        mean_income = ifelse(
            is.na(mean_income),
            mean_per_country,
            mean_income
        )
    )                                                           |>
    pivot_wider(
        names_from = "year",
        values_from = "mean_income"
    )
## # A tibble: 5 × 6
## # Groups: country [5]
##   country mean_per_country `2002`  `2003`  `2004`
##   <chr>              <dbl>  <dbl>   <dbl>   <dbl>
## 1 Numenor          193422  123456 132654  193422
## 2 Westeros         389788. 314256 389788. 389788.
## 3 Narnia           432156  432156 432156  432156
## 4 Gondor           387968. 531426 321465  235461
## 5 Laputa            36179.  14235  34125   45123
## # . . . with 1 more variable: `2005` <dbl>
The first step in the pipeline is reformatting the data. We saw how this works in Chapter 5.
mean_income |>
    pivot_longer(
        names_to = "year",
        values_to = "mean_income",
        !country
    )
## # A tibble: 20 × 3
##   country   year   mean_income
##   <chr>     <chr>        <dbl>
## 1 Numenor   2002        123456
## 2 Numenor   2003        132654
## 3 Numenor   2004            NA
## 4 Numenor   2005        324156
## 5 Westeros  2002        314256
## 6 Westeros  2003            NA
## 7 Westeros  2004            NA
## 8 Westeros  2005        465321
## 9 Narnia    2002        432156
## 10 Narnia   2003            NA
## 11 Narnia   2004            NA
## 12 Narnia   2005            NA
## 13 Gondor   2002        531426
## 14 Gondor   2003        321465
## 15 Gondor   2004        235461
## 16 Gondor   2005        463521
## 17 Laputa   2002         14235
## 18 Laputa   2003         34125
## 19 Laputa   2004         45123
## 20 Laputa   2005         51234

We use pivot_longer() to get all the income data from separate columns into a single mean_income column, where the year column will hold the information about which original column the data came from.

In the next two steps, we group by countries, so the means we calculate are per country and not the full data set, and then we compute the means and replace NA cells with them. To get a glimpse into the mutate() call, you can replace it with summarise() and see the mean of each country.
mean_income |>
    pivot_longer(
        names_to = "year",
        values_to = "mean_income",
        !country
    )                                                         |>
    group_by(
        country
    )                                                         |>
    summarise(
        per_country_mean = mean(mean_income, na.rm = TRUE)
    )
## # A tibble: 5 × 2
##   country  per_country_mean
##   <chr>               <dbl>
## 1 Gondor            387968.
## 2 Laputa             36179.
## 3 Narnia             432156
## 4 Numenor            193422
## 5 Westeros          389788.
In the next step, we can replace the missing data with their country mean in the mutate() call because we can refer to earlier columns when we create later columns in mutate().
mean_income                                                   |>
    pivot_longer(
        names_to = "year",
        values_to = "mean_income",
        !country
    )                                                          |>
    group_by(
        country
    )                                                          |>
    mutate(
        mean_per_country = mean(mean_income, na.rm = TRUE),
        mean_income = ifelse(
            is.na(mean_income),
            mean_per_country,
            mean_income
       )
    )
## # A tibble: 20 × 4
## # Groups:  country [5]
##   country  year mean_income mean_per_country
##   <chr>    <chr>      <dbl>            <dbl>
## 1 Numenor  2002     123456           193422
## 2 Numenor  2003     132654           193422
## 3 Numenor  2004     193422           193422
## 4 Numenor  2005     324156           193422
## 5 Westeros 2002     314256           389788.
## 6 Westeros 2003     389788.          389788.
## 7 Westeros 2004     389788.          389788.
## 8 Westeros 2005     465321           389788.
## 9 Narnia   2002     432156           432156
## 10 Narnia  2003     432156           432156
## 11 Narnia  2004     432156           432156
## 12 Narnia  2005     432156           432156
## 13 Gondor  2002     531426           387968.
## 14 Gondor  2003     321465           387968.
## 15 Gondor  2004     235461           387968.
## 16 Gondor  2005     463521           387968.
## 17 Laputa  2002      14235            36179.
## 18 Laputa  2003      34125            36179.
## 19 Laputa  2004      45123            36179.
## 20 Laputa  2005      51234            36179.
If you had a pipeline where you had steps between the group_by() call and the mutate() call that sets the mean_income, you have to be careful. Do not put an ungroup() in there. The pipeline works because the mean_income is per country when we call mutate() . If we ungrouped, we would get the mean over all countries.
mean_income |>
    pivot_longer(
        names_to = "year",
        values_to = "mean_income",
        !country
    )                                                            |>
    group_by(
        country
    )                                                            |>
    # Imagine we are doing something here that removes the group...
    ungroup()                                                    |>
    # The mutate that follows is not grouped so the mean is global...
    mutate(
        mean_per_country = mean(mean_income, na.rm = TRUE),
        mean_income = ifelse(
            is.na(mean_income),
            mean_per_country,
            mean_income
        )
    )
## # A tibble: 20 × 4
##   country   year   mean_income   mean_per_country
##   <chr>     <chr>        <dbl>              <dbl>
## 1 Numenor   2002       123456             249185.
## 2 Numenor   2003       132654             249185.
## 3 Numenor   2004       249185.            249185.
## 4 Numenor   2005       324156             249185.
## 5 Westeros  2002       314256             249185.
## 6 Westeros  2003       249185.            249185.
## 7 Westeros  2004       249185.            249185.
## 8 Westeros  2005       465321             249185.
## 9 Narnia    2002       432156             249185.
## 10 Narnia   2003       249185.            249185.
## 11 Narnia   2004       249185.            249185.
## 12 Narnia   2005       249185.            249185.
## 13 Gondor   2002       531426             249185.
## 14 Gondor   2003       321465             249185.
## 15 Gondor   2004       235461             249185.
## 16 Gondor   2005       463521             249185.
## 17 Laputa   2002        14235             249185.
## 18 Laputa   2003        34125             249185.
## 19 Laputa   2004        45123             249185.
## 20 Laputa   2005        51234             249185.

In the last step, we just use pivot_wider() to transform the long table back to the original format; there is nothing special here that you haven’t seen in Chapter 5.

This is how we replace all missing data with the mean for the corresponding countries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.202.221