© Thomas Mailund 2019
Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_7

7. Manipulating Data Frames: dplyr

Thomas Mailund1 
(1)
Aarhus, Denmark
 
The dplyr package resembles the functionality in the purrr package, but it is designed for manipulating data frames. It will be loaded with the tidyverse package , but you can also load it using
library(dplyr)

The usual way that you use dplyr is similar to how you use purrr. You string together a sequence of actions in a pipeline, with the actions separated by the %>% operator. The difference between the two packages is that the purrr functions work on sequences, while the dplyr functions work on data frames.

This package is huge and more functionality is added in each new release, so it would be impossible for me to describe everything you can do with dplyr. I can only give you a flavor of the package functionality and refer you to the documentation for more information.

Selecting Columns

One of the simplest operations you can do on a data frame is selecting one or more of its columns. You can do this by indexing with $:
iris_df <- as_tibble(iris)
print(iris_df, n = 3)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 147 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
head(iris_df$Species)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
This, however, does not work well with pipelines where a data frame is flowing through from action to action. In dplyr, you have the function select() for picking out columns. You can give it one or more column names, and it will give you a data frame containing only those columns.
iris_df %>%
    select(Sepal.Length, Species) %>%
    print(n = 3)
## # A tibble: 150 x 2
##   Sepal.Length Species
##          <dbl> <fct>
## 1          5.1 setosa
## 2          4.9 setosa
## 3          4.7 setosa
## # ... with 147 more rows
You can also use complements when selecting columns; just put a minus in front of the column names.
iris_df %>%
    select(-Species) %>%
    print(n = 3)
## # A tibble: 150 x 4
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 147 more rows, and 1 more variable:
## #  Petal.Width <dbl>
iris_df %>%
    select(-Species, -Sepal.Length) %>%
    print(n = 3)
## # A tibble: 150 x 3
##   Sepal.Width Petal.Length Petal.Width
##         <dbl>        <dbl>       <dbl>
## 1         3.5          1.4         0.2
## 2         3            1.4         0.2
## 3         3.2          1.3         0.2
## # ... with 147 more rows
You do not need to use column names. You can also use indices.
iris_df %>%
    select(1) %>%
    print(n = 3)
## # A tibble: 150 x 1
##   Sepal.Length
##          <dbl>
## 1          5.1
## 2          4.9
## 3          4.7
## # ... with 147 more rows
iris_df %>%
    select(1:3) %>%
    print(n = 3)
## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 147 more rows
This, of course, is less informative about what is being selected when you read the code. The second selection in the preceding code shows that you can extract ranges. You can also do this with column names.
iris_df %>%
    select(Petal.Length:Species) %>%
    print(n = 3)
## # A tibble: 150 x 3
##   Petal.Length Petal.Width Species
##          <dbl>       <dbl> <fct>
## 1          1.4         0.2 setosa
## 2          1.4         0.2 setosa
## 3          1.3         0.2 setosa
## # ... with 147 more rows
With select() you can also use various predicates for selecting columns. For example, you can pick columns based on the properties of the names. The starts_with() function lets you select columns based on the prefix of the column names and the ends_with() function based on the suffix of the column names:
iris_df %>%
    select(starts_with("Petal")) %>%
    print(n = 3)
## # A tibble: 150 x 2
##   Petal.Length Petal.Width
##          <dbl>       <dbl>
## 1          1.4         0.2
## 2          1.4         0.2
## 3          1.3         0.2
## # ... with 147 more rows
iris_df %>%
    select(-starts_with("Petal")) %>%
    print(n = 3)
## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <fct>
## 1          5.1         3.5 setosa
## 2          4.9         3   setosa
## 3          4.7         3.2 setosa
## # ... with 147 more rows
iris_df %>%
    select(starts_with("Petal"), Species) %>%
    print(n = 3)
## # A tibble: 150 x 3
##   Petal.Length Petal.Width Species
##          <dbl>       <dbl> <fct>
## 1          1.4         0.2 setosa
## 2          1.4         0.2 setosa
## 3          1.3         0.2 setosa
## # ... with 147 more rows
iris_df %>%
    select(starts_with("PETAL", ignore.case = TRUE)) %>%
    print(n = 3)
## # A tibble: 150 x 2
##   Petal.Length Petal.Width
##          <dbl>       <dbl>
## 1          1.4         0.2
## 2          1.4         0.2
## 3          1.3         0.2
## # ... with 147 more rows
iris_df %>%
    select(starts_with("S")) %>%
    print(n = 3)
## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <fct>
## 1          5.1         3.5 setosa
## 2          4.9         3   setosa
## 3          4.7         3.2 setosa
## # ... with 147 more rows
iris_df %>%
    select(ends_with("Length")) %>%
    print(n = 3)
## # A tibble: 150 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## # ... with 147 more rows
If you use the contains() function , you select columns whose names contain a string.
iris_df %>%
    select(contains("ng")) %>%
    print(n = 3)
## # A tibble: 150 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## # ... with 147 more rows
The most powerful predicate function is matches() . It lets you select columns using regular expressions. For the iris_df example, there is not much that you can select using a regular expression that you cannot select using one of the simpler functions, but consider this table:
df <- tribble(
    ~a1x, ~ax, ~a2x, ~b2, ~b2y, ~by,
    1, 2, 3, 4, 5, 6
)
We can pick a column that contains an integer using the regular expression .*\d.*:
df %>%
    select(matches(".*\d.*")) %>%
    print(n = 3)
## # A tibble: 1 x 4
##     a1x   a2x    b2    b2y
##   <dbl> <dbl> <dbl>  <dbl>
## 1     1     3     4      5
If we need the names to have non-empty strings before and after the digit, we can use this expression:
df %>%
    select(matches(".+\d.+")) %>%
    print(n = 3)
## # A tibble: 1 x 3
##     a1x   a2x   b2y
##   <dbl> <dbl> <dbl>
## 1     1     3     5
You can also use select() to rename columns. If you use a named parameter, the columns you select will get the names you use for the function parameters.
iris_df %>%
    select(sepal_length = Sepal.Length,
           sepal_width = Sepal.Width) %>%
    print(n = 3)
## # A tibble: 150 x 2
##   sepal_length sepal_width
##          <dbl>       <dbl>
## 1          5.1         3.5
## 2          4.9         3
## 3          4.7         3.2
## # ... with 147 more rows
Since select() removes the columns you do not select, it might not always be the best approach. A similar function to select(), where you do not lose columns, is rename(). With this function, you can rename columns and keep the remaining columns as well.
iris_df %>%
    rename(sepal_length = Sepal.Length,
           sepal_width = Sepal.Width) %>%
    print(n = 3)
## # A tibble: 150 x 5
##   sepal_length sepal_width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 147 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>

Filter

While select() extracts a subset of columns, the filter() function does the same for rows.

The iris_df contains three different species. We can see this using the distinct() function. This function is also from the dplyr package, and it gives you all the unique rows from selected columns.
iris_df %>%
    distinct(Species)
## # A tibble: 3 x 1
##   Species
##   <fct>
## 1 setosa
## 2 versicolor
## 3 virginica
We can select the rows where the species is “setosa” using filter() :
iris_df %>%
    filter(Species == "setosa") %>%
    print(n = 3)
## # A tibble: 50 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 47 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
We can combine filter() and select() in a pipeline to get a subset of the columns as well as a subset of the rows.
iris_df %>%
    filter(Species == "setosa") %>%
    select(ends_with("Length"), Species) %>%
    print(n = 3)
## # A tibble: 50 x 3
##   Sepal.Length Petal.Length Species
##          <dbl>        <dbl> <fct>
## 1          5.1          1.4 setosa
## 2          4.9          1.4 setosa
## 3          4.7          1.3 setosa
## # ... with 47 more rows
Generally, we can string together as many dplyr (or purrr) functions as we desire in pipelines.
iris_df %>%
    filter(Species != "setosa") %>%
    distinct(Species) %>%
    print(n = 3)
## # A tibble: 2 x 1
##   Species
##   <fct>
## 1 versicolor
## 2 virginica
We can use more than one columns to filter by
iris_df %>%
    filter(Sepal.Length > 5, Petal.Width < 0.4) %>%
    print(n = 3)
## # A tibble: 15 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          5.4         3.7          1.5
## 3          5.8         4            1.2
## # ... with 12 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
We can also use functions as a predicate, for example, the between() function to select numbers in a given range.
iris_df %>%
    filter(between(Sepal.Width, 2, 2.5)) %>%
    print(n = 3)
## # A tibble: 19 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.5         2.3          1.3
## 2          5.5         2.3          4
## 3          4.9         2.4          3.3
## # ... with 16 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
We cannot use the functions starts_with(), ends_with(), and so on. that we can use with select() . This does not mean, however, that we cannot filter rows using string patterns. We just need different functions. For example, we can use str_starts() from the stringr package. We return to stringr in Chapter 8.
iris_df %>%
    filter(str_starts(Species, "v")) %>%
    print(n = 3)
## # A tibble: 100 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          7           3.2          4.7
## 2          6.4         3.2          4.5
## 3          6.9         3.1          4.9
## # ... with 97 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
iris_df %>%
    filter(str_ends(Species, "r")) %>%
    print(n = 3)
## # A tibble: 50 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          7           3.2          4.7
## 2          6.4         3.2          4.5
## 3          6.9         3.1          4.9
## # ... with 47 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>

While filter() selects rows by applying predicates on individual columns, other filter variants will use predicates over more than one column.

The filter_all() function will apply the predicate over all columns. You must provide an expression where you use . (dot) for the table cell value and wrap the expression in one of two functions: any_var() or all_var(). If you use any_var(), it suffices that one column satisfies the predicate for the row to be included; if you use all_var(), then all columns must satisfy the predicate.

We can require that any value in the iris_df data must be larger than five:
iris_df %>%
    select(-Species) %>%
    filter_all(any_vars(. > 5)) %>%
    print(n = 3)
## # A tibble: 118 x 4
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          5.4         3.9          1.7
## 3          5.4         3.7          1.5
## # ... with 115 more rows, and 1 more variable:
## #  Petal.Width <dbl>
The expression . > 5 is not meaningful if . is a string, so I had to remove the Species column before I filtered. Rarely will we accept to lose an informative variable. If you want to keep a column, but not apply the predicate to values in it, you can use the filter_at() function .
iris_df %>%
    filter_at(vars(-Species), any_vars(. > 5)) %>%
    print(n = 3)
## # A tibble: 118 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          5.4         3.9          1.7
## 3          5.4         3.7          1.5
## # ... with 115 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
You can give filter_at() a vector of strings or a vector of variables (unquoted column names). Because I need to negate Species to select all other columns, I used vars() here. In other cases, you can use either option.
iris_df %>%
    filter_at(c("Petal.Length", "Sepal.Length"),
               any_vars(. > 0)) %>%
    print(n = 3)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 147 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
iris_df %>%
    filter_at(vars(Petal.Length, Sepal.Length),
              any_vars(. > 0)) %>%
    print(n = 3)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## # ... with 147 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
You can use filter_if() to use a predicate over the columns:
iris_df %>%
    filter_if(is.numeric, all_vars(. < 5)) %>%
    print(n = 3)
## # A tibble: 22 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.9         3            1.4
## 2          4.7         3.2          1.3
## 3          4.6         3.1          1.5
## # ... with 19 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>

You can use functions or lambda expressions as the predicate.

Consider a case where we have numbers with missing data in a table. If we want to filter the rows where all values are greater than three, we can use filter_all() :
df <- tribble(
    ~A, ~B, ~C,
     1,  2,  3,
     4,  5, NA,
    11, 12, 13,
    22, 22,  1
)
df %>% filter_all(all_vars(. > 3))
## # A tibble: 1 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1    11    12    13
It removes the first two and the last row because they all contain values smaller or equal to three. It also deletes the third row because NA is not considered greater than three. We can restrict the tests to the columns that do not contain missing values:
df %>%
    filter_if(~ all(!is.na(.)), all_vars(. > 3))
## # A tibble: 3 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     4     5    NA
## 2    11    12    13
## 3    22    22     1

This removes column C from the tests. It means that we also keep the last row even though C has a value smaller than three.

If you want to keep the two middle row but not the first or last, you can write a more complex predicate and use filter_all() .
df %>% filter_all(all_vars(is.na(.) | . > 3))
## # A tibble: 2 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     4     5    NA
## 2    11    12    13

You need to use the | vector-or. The expression needs to be a vector expression so you cannot use ||.

Many dplyr functions have _all, _at, and _if variants, but I will only describe the frequently used functions here. For the others, I refer to the package documentation.

Sorting

If you want to sort rows by values in selected columns, the function you want is called arrange() . You can sort by one or more columns:
iris_df %>%
    arrange(Petal.Length) %>%
    print(n = 5)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.6         3.6          1
## 2          4.3         3            1.1
## 3          5.8         4            1.2
## 4          5           3.2          1.2
## 5          4.7         3.2          1.3
## # ... with 145 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
iris_df %>%
    arrange(Sepal.Length, Petal.Length) %>%
    print(n = 5)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          4.3         3            1.1
## 2          4.4         3            1.3
## 3          4.4         3.2          1.3
## 4          4.4         2.9          1.4
## 5          4.5         2.3          1.3
## # ... with 145 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>
If you want to sort in descending order, you can use the function desc() .
iris_df %>%
    arrange(desc(Petal.Length)) %>%
    print(n = 5)
## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          7.7         2.6          6.9
## 2          7.7         3.8          6.7
## 3          7.7         2.8          6.7
## 4          7.6         3            6.6
## 5          7.9         3.8          6.4
## # ... with 145 more rows, and 2 more variables:
## #  Petal.Width <dbl>, Species <fct>

Modifying Data Frames

When we work with data frames, we usually want to compute values based on the variables (columns) in our tables. Filtering rows and columns will only get us so far.

The mutate() function lets us add columns to a data frame based on expressions that can involve any of the existing columns. Consider a table of widths and heights.
df <- tribble(
    ~height, ~width,
         10,     12,
         42,     24,
         14,     12
)
We can add an area column using mutate() and this expression:
df %>% mutate(area = height * width)
## # A tibble: 3 x 3
##   height width  area
##    <dbl> <dbl> <dbl>
## 1     10    12   120
## 2     42    24  1008
## 3     14    12   168
If you add columns to a data frame, you can refer to variables you have already added. For example, if your height and width data are in inches and you want them in centimeters, and you also want the area in centimeters squared, you can do this:
cm_per_inch = 2.54
df %>% mutate(
    height_cm = cm_per_inch * height,
    width_cm = cm_per_inch * width,
    area_cm = height_cm * width_cm
)
## # A tibble: 3 x 5
##   height width height_cm width_cm area_cm
##    <dbl> <dbl>     <dbl>    <dbl>   <dbl>
## 1     10    12      25.4     30.5    774.
## 2     42    24     107.      61.0   6503.
## 3     14    12      35.6     30.5   1084.
The following expression, however, will not work since you cannot refer to variables you are adding later in the mutate() call.
df %>% mutate(
    area_cm   = height_cm * width_cm,
    height_cm = cm_per_inch * height,
    width_cm  = cm_per_inch * width
)
If you do not give the new variable a name, that is, you call mutate() with named parameters, then the expression will become the variable name:
df %>% mutate(cm_per_inch * height)
## # A tibble: 3 x 3
##   height width `cm_per_inch * height`
##    <dbl> <dbl>                  <dbl>
## 1     10    12                   25.4
## 2     42    24                  107.
## 3     14    12                   35.6
In this example, the units (inches and centimeters), are not encoded with the data, and in future computations, you might end up mixing the wrong units. If you use the units package —which is not part of the Tidyverse—you can make the units explicit, automatically convert between units where that makes sense, and you will get unit type safety as a bonus.
library(units)
df %>% mutate(
    height_in = units::as_units(height, "in"),
    width_in  = units::as_units(width, "in"),
    area_in   = height_in * width_in,
    height_cm = units::set_units(height_in, "cm"),
    width_cm  = units::set_units(width_in, "cm"),
    area_cm   = units::set_units(area_in, "cm^2")
)
## # A tibble: 3 x 8
##   height width height_in width_in area_in
##    <dbl> <dbl>      [in]     [in]  [in^2]
## 1     10    12        10       12     120
## 2     42    24        42       24    1008
## 3     14    12        14       12     168
## # ... with 3 more variables: height_cm [cm],
## #  width_cm [cm], area_cm [cm^2]

This is possible because tibbles can contain classes of various types, which the units package exploits. Tibbles use an interface that lets you define how your own classes should be displayed,1 but this is beyond the scope of this book.

In the preceding example, we created unit-carrying values for the original data table, but the original height and weight columns are still around. We could use select() to remove them, but the function transmute() already does this for us.
df %>% transmute(
    height_in = units::as_units(height, "in"),
    width_in  = units::as_units(width, "in"),
    area_in   = height_in * width_in,
    height_cm = units::set_units(height_in, "cm"),
    width_cm  = units::set_units(width_in, "cm"),
    area_cm   = units::set_units(area_in, "cm^2")
)
## # A tibble: 3 x 6
##   height_in width_in area_in height_cm width_cm
##        [in]     [in]  [in^2]      [cm]     [cm]
## 1        10       12     120     25.40    30.48
## 2        42       24    1008    106.68    60.96
## 3        14       12     168     35.56    30.48
## # ... with 1 more variable: area_cm [cm^2]

When you use transmute() instead of mutate(), the variables that you are not assigning to will be removed.

As with filtering, the expressions you use when mutating a data frame must be vector expressions. Consider this example:
df <- tibble(
    x = rnorm(3, mean = 12, sd = 5),
)
my_abs <- function(x) if (x < 0) -x else x
df %>% mutate(my_abs(x))
## Warning in if (x < 0) -x else x: the condition has
## length > 1 and only the first element will be used
## # A tibble: 3 x 2
##       x `my_abs(x)`
##   <dbl>       <dbl>
## 1 16.1        16.1
## 2 -2.07       -2.07
## 3 14.9        14.9

It fails because the function my_abs() (my version of the abs() function) is not a vector expression. It is not a vector expression because of x in the if expression.

If we use the built-in abs() function , then there is no problem with the expression; it handles vectors.
df %>% mutate(abs(x))
## # A tibble: 3 x 2
##       x `abs(x)`
##   <dbl>    <dbl>
## 1 16.1     16.1
## 2 -2.07     2.07
## 3 14.9     14.9
If you use the ifelse() vector version of if expressions, you get a vector expression.
ifelse_abs <- function(x) ifelse(x < 0, -x, x)
df %>% mutate(ifelse_abs(x))
## # A tibble: 3 x 2
##       x `ifelse_abs(x)`
##   <dbl>           <dbl>
## 1 16.1            16.1
## 2 -2.07            2.07
## 3 14.9            14.9
Or you can use the Vectorize() function to make a vector expression out of a function that does not handle vectors.
my_abs <- Vectorize(my_abs)
df %>% mutate(my_abs(x))
## # A tibble: 3 x 2
##       x `my_abs(x)`
##   <dbl>       <dbl>
## 1 16.1        16.1
## 2 -2.07        2.07
## 3 14.9        14.9

If you need to map the input to several different output values, you can nest ifelse() expressions arbitrarily deep, but it gets difficult to read. A function that alleviates this problem substantially is case_when(). You can give it a sequence of predicates with matching expressions, and it will return the expression for the first matching predicate.

Consider this example where we use case_when() to categorize the variable x based on its value:
df <- tibble(x = rnorm(100))
df %>%
    mutate(
        x_category = case_when(
            x - mean(x) < -2 * sd(x) ~ "small",
            x - mean(x) > 2 * sd(x) ~ "large",
            TRUE ~ "medium"
        )
    ) %>%
    print(n = 3)
## # A tibble: 100 x 2
##         x x_category
##     <dbl> <chr>
## 1 -0.311  medium
## 2  0.0992 medium
## 3 -1.26   medium
## # ... with 97 more rows

The TRUE line corresponds to the else part of an if-else statement.

Grouping and Summarizing

In many analyses, we need summary statistics of our data. If you can map one or more of your columns into a single summary, you can use the summarise() function .
df <- tibble(x = rnorm(100), y = rnorm(100))
df %>% summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 1 x 2
##    mean_x mean_y
##     <dbl>  <dbl>
## 1 -0.0529 0.0141

In this example, we summarized our data as the mean of variables x and y.

If you split your data into different classes and want to work on the data per group, you can use the function group_by() .
classify <- function(x) {
    case_when(
        x - mean(x) < -2 * sd(x) ~ "small",
        x - mean(x) > 2 * sd(x) ~ "large",
        TRUE ~ "medium"
    )
}
df %>%
    mutate(x_category = classify(x)) %>%
    group_by(x_category) %>%
    print(n = 3)
## # A tibble: 100 x 3
## # Groups: x_category [2]
##        x     y x_category
##    <dbl> <dbl> <chr>
## 1 -0.702 0.959 medium
## 2  1.10  0.557 medium
## 3 -0.667 0.399 medium
## # ... with 97 more rows
The result is a data frame that, when we print it, doesn’t look different from before we grouped the data.
df %>%
    mutate(x_category = classify(x)) %>%
    print(n = 3)
## # A tibble: 100 x 3
##        x     y x_category
##    <dbl> <dbl> <chr>
## 1 -0.702 0.959 medium
## 2  1.10  0.557 medium
## 3 -0.667 0.399 medium
## # ... with 97 more rows
In the header, however, you will notice the line
# Groups:   x_category [3]
This tells you that something is different when you have grouped your data. This is apparent when you combine grouping with summarizing.
df %>%
    mutate(x_category = classify(x)) %>%
    group_by(x_category) %>%
    summarise(mean_x = mean(x), no_x = n())
## # A tibble: 2 x 3
##   x_category mean_x  no_x
##   <chr>       <dbl> <int>
## 1 large       2.34      4
## 2 medium     -0.153    96

When you group your data and then summarize, you get a per-group summary statistics. Here, we calculate the mean and the number of observations in each category (the function n() gives you the number of observations).

You can get the variables your data is grouped by using the group_vars() function .
df %>%
    mutate(x_category = classify(x)) %>%
    group_by(x_category) %>%
    group_vars()
## [1] "x_category"
You can group data by more than one variable.
df <- tibble(x = rnorm(100), y = rnorm(100))
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    group_by(x_category, y_category) %>%
    group_vars()
## [1] "x_category" "y_category"
If you do, you will get summaries for each combination of the grouping variables.
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    group_by(x_category, y_category) %>%
    summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 6 x 4
## # Groups:   x_category [3]
##   x_category y_category mean_x  mean_y
##   <chr>      <chr>       <dbl>   <dbl>
## 1 large      medium      2.02  -0.836
## 2 medium     large       0.871  1.98
## 3 medium     medium     -0.118  0.0213
## 4 medium     small      -0.679 -2.69
## 5 small      large      -3.48   1.96
## 6 small      medium     -2.74  -0.0693
If you group by one variable and summarize another, you get summaries for both variables according for each of the grouped by variables.
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    group_by(x_category) %>%
    summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 3 x 3
##   x_category mean_x  mean_y
##   <chr>       <dbl>   <dbl>
## 1 large       2.02  -0.836
## 2 medium     -0.114  0.0133
## 3 small      -2.98   0.607

Here, we calculate the mean of x and y for each category of x value. In general, you can use any number of variables to group the data and any number of variables to summarize, and the summaries will be computed per group.

When you have grouped your data by more than one variable, a summary will remove the last group you added, that is, the last variable you grouped by. A summary gives you a new data frame containing the grouping variable and the summaries. If you have grouped with more than one variable, this data frame will be grouped by all but the last grouping variables.

For example, if we group by variables A and B and then summarise another variable, C, the result will be grouped by A and not B. If you group by A, B, and D, the result will be a data frame grouped by A and B.
df2 <- tribble(
    ~A,      ~B,    ~C, ~D,
    "left",  "up",   2, "yes",
    "right", "up",   5, "no",
    "left",  "down", 2, "yes",
    "left",  "down", 7, "no",
    "left",  "down", 3, "no",
    "right", "up",   8, "yes",
    "right", "up",   2, "yes",
    "right", "up",   8, "no"
)
df2 %>% group_by(A, B) %>%
    summarise(min_c = min(C), max_c = max(C))
## # A tibble: 3 x 4
## # Groups:   A [2]
##   A     B     min_c max_c
##   <chr> <chr> <dbl> <dbl>
## 1 left  down      2     7
## 2 left  up        2     2
## 3 right up        2     8
df2 %>% group_by(A, B) %>%
    summarise(min_c = min(C), max_c = max(C)) %>%
    summarise(max_diff = max(max_c - min_c))
## # A tibble: 2 x 2
##   A     max_diff
##   <chr>    <dbl>
## 1 left         5
## 2 right        6
df2 %>% group_by(A, B, D) %>%
    summarise(min_c = min(C), max_c = max(C))
## # A tibble: 5 x 5
## # Groups:   A, B [3]
##   A     B     D     min_c max_c
##   <chr> <chr> <chr> <dbl> <dbl>
## 1 left  down  no        3     7
## 2 left  down  yes       2     2
## 3 left  up    yes       2     2
## 4 right up    no        5     8
## 5 right up    yes       2     8
df2 %>% group_by(A, B, D) %>%
    summarise(min_c = min(C), max_c = max(C)) %>%
    summarise(max_diff = max(max_c - min_c))
## # A tibble: 3 x 3
## # Groups:   A [2]
##   A     B     max_diff
##   <chr> <chr>    <dbl>
## 1 left  down         4
## 2 left  up           0
## 3 right up           6
The order you use in group_by() is important here. If you flip the variables, you will get a data frame that is still grouped by B instead.
df2 %>% group_by(A, B) %>%
    summarise(min_c = min(C), max_c = max(C))
## # A tibble: 3 x 4
## # Groups:   A [2]
##   A     B     min_c max_c
##   <chr> <chr> <dbl> <dbl>
## 1 left  down      2     7
## 2 left  up        2     2
## 3 right up        2     8

The way grouping variables are removed one by one each time you call summarise() is a potential source of errors. Say you have variables A, B, C, and D, grouped by A, B, and D. Within each combination of these, you define a distance as the maximum value of C minus its minimum value. You want to know the smallest difference between maximum and minimum in the overall data. You might, reasonably, try this:

You compute the minimum and maximum per group:
df2 %>% group_by(A, B, D) %>%
    summarise(min_c = min(C), max_c = max(C)) %>%
## Error: <text>:3:0: unexpected end of input
## 1: df2 %>% group_by(A, B, D) %>%
## 2:    summarise(min_c = min(C), max_c = max(C)) %>%
##   ^
Then you try another summarise() to find the minimum between max_c and min_c for each row in the result:
df2 %>% group_by(A, B, D) %>%
    summarise(min_c = min(C), max_c = max(C)) %>%
    summarise(min_diff = min(max_c - min_c))
## # A tibble: 3 x 3
## # Groups:   A [2]
##   A     B     min_diff
##   <chr> <chr>    <dbl>
## 1 left  down         0
## 2 left  up           0
## 3 right up           3

This, however, gives you the smallest difference for each combination of A and B, but not the overall smallest. The first summarise only removes the D group (and the second removes B).

If you have grouped your data, or you get grouped data from one summary as in the preceding example, you can remove the groups again using ungroup() . Once you have done this, you can compute global summaries again.
df2 %>% group_by(A, B, D) %>%
    summarise(min_c = min(C), max_c = max(C)) %>%
    ungroup() %>%
    summarise(min_diff = min(max_c - min_c))
## # A tibble: 1 x 1
##   min_diff
##      <dbl>
## 1        0
Grouping is not only useful when collecting summaries of your data. You can also add columns to your data frame based on per-group computations using mutate() :
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    group_by(x_category) %>%
    mutate(mean_x = mean(x), mean_y = mean(y)) %>%
    print(n = 5)
## # A tibble: 100 x 6
## # Groups:   x_category [3]
##        x       y x_category y_category mean_x
##    <dbl>   <dbl> <chr>      <chr>       <dbl>
## 1  0.352  1.23   medium     medium     -0.114
## 2 -2.84  -0.705  small      medium     -2.98
## 3  0.332 -0.0964 medium     medium     -0.114
## 4  1.37  -1.28   medium     medium     -0.114
## 5 -0.893 -0.373  medium     medium     -0.114
## # ... with 95 more rows, and 1 more variable:
## #  mean_y <dbl>

When you calculate a summary and then mutate, you create columns where a summary parameter takes values based on the groupings, but they are combined with the existing data rather than extracted as a new data frame as summarise() would do.

A mutate() without a group_by() will give you summaries for the entire data.
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    mutate(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 100 x 6
##         x       y x_category y_category mean_x
##     <dbl>   <dbl> <chr>      <chr>       <dbl>
##  1  0.352  1.23   medium     medium     -0.157
##  2 -2.84  -0.705  small      medium     -0.157
##  3  0.332 -0.0964 medium     medium     -0.157
##  4  1.37  -1.28   medium     medium     -0.157
##  5 -0.893 -0.373  medium     medium     -0.157
##  6 -0.561  0.465  medium     medium     -0.157
##  7 -1.14  -0.874  medium     medium     -0.157
##  8  0.812 -0.897  medium     medium     -0.157
##  9  0.162  0.634  medium     medium     -0.157
## 10 -1.12  -0.311  medium     medium     -0.157
## # ...  with 90 more rows, and 1 more variable:
## #  mean_y <dbl>
In contrast, if you group before adding variables to the data, then the summaries are per group.
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    group_by(x_category) %>%
    mutate(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 100 x 6
## # Groups:   x_category [3]
##         x       y x_category y_category mean_x
##     <dbl>   <dbl> <chr>      <chr>       <dbl>
##  1  0.352  1.23   medium     medium     -0.114
##  2 -2.84  -0.705  small      medium     -2.98
##  3  0.332 -0.0964 medium     medium     -0.114
##  4  1.37  -1.28   medium     medium     -0.114
##  5 -0.893 -0.373  medium     medium     -0.114
##  6 -0.561  0.465  medium     medium     -0.114
##  7 -1.14  -0.874  medium     medium     -0.114
##  8  0.812 -0.897  medium     medium     -0.114
##  9  0.162  0.634  medium     medium     -0.114
## 10 -1.12  -0.311  medium     medium     -0.114
## # ...  with 90 more rows, and 1 more variable:
## #  mean_y <dbl>
You can combine data-wide summaries with grouped summaries if you calculate the global summaries before you group.
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    mutate(mean_y = mean(y)) %>%
    group_by(x_category) %>%
    mutate(mean_x = mean(x)) %>%
    distinct(mean_x, mean_y)
## # A tibble: 3 x 3
## # Groups:   x_category [3]
##   mean_x mean_y x_category
##    <dbl>  <dbl> <chr>
## 1 -0.114 0.0142 medium
## 2 -2.98  0.0142 small
## 3  2.02  0.0142 large

Here I used mutate() + distinct() rather than summarise() to keep all the variables in the output. With summarise() , I would only keep the grouping variables and the summaries.

If you need to compute summaries with different groups, for example, the mean for x for each x_category as well as the mean of y for each y_category, then you can call group_by(x_category) for summarizing x, followed by group_by(y_category) for changing the grouping so you can summarize y.
df %>%
    mutate(
        x_category = classify(x),
        y_category = classify(y)
    ) %>%
    group_by(x_category) %>%
    mutate(mean_x = mean(x)) %>%
    group_by(y_category) %>%
    mutate(mean_y = mean(y)) %>%
    distinct(
        x_category, mean_x,
        y_category, mean_y
)
## # A tibble: 6 x 4
## # Groups:   y_category [3]
##   x_category mean_x y_category  mean_y
##   <chr>       <dbl> <chr>        <dbl>
## 1 medium     -0.114 medium     0.00175
## 2 small      -2.98  medium     0.00175
## 3 medium     -0.114 large      1.97
## 4 medium     -0.114 small     -2.69
## 5 small      -2.98  large      1.97
## 6 large       2.02  medium     0.00175

Joining Tables

It is not uncommon to have your data in more than one table. This could be because the tables are created from different calculations, for example, different kinds of summaries, or it can be because you got the data from different files.

If you merely need to combine tables by row or column, then you can use the bind_rows() and bind_columns() functions which do precisely what you would expect.
df1 <- tibble(
    A = paste0("a", 1:2),
    B = paste0("b", 1:2)
)
df2 <- tibble(
    A = paste0("a", 3:4),
    B = paste0("b", 3:4)
)
df3 <- tibble(
    C = paste0("c", 1:2),
    D = paste0("d", 1:2)
)
bind_rows(df1, df2)
## # A tibble: 4 x 2
##   A     B
##   <chr> <chr>
## 1 a1    b1
## 2 a2    b2
## 3 a3    b3
## 4 a4    b4
bind_cols(df1, df3)
## # A tibble: 2 x 4
##   A     B     C     D
##   <chr> <chr> <chr> <chr>
## 1 a1    b1    c1    d1
## 2 a2    b2    c2    d2

When you combine rows, the tables must have the same columns, and when you combine by column, the tables must have the same number of rows.

If you have tables that represent different relations between variables—the underlying principle of relational databases aimed at avoiding duplicated data—then you can combine them using join functions.

Say you have a table that maps students to grades for each class you teach. You can join tables from two different classes using inner_join() . You use a key to join them on, specified by argument by.
grades_maths <- tribble(
    ~name,            ~grade,
    "Marko Polo",     "D",
    "Isaac Newton",   "A+",
    "Charles Darwin", "B"
)
grades_biology <- tribble(
    ~name,            ~grade,
    "Marko Polo",     "F",
    "Isaac Newton",   "D",
    "Charles Darwin", "A+"
)
inner_join(grades_maths, grades_biology, by = "name")
## # A tibble: 3 x 3
##   name           grade.x grade.y
##   <chr>          <chr>   <chr>
## 1 Marko Polo     D       F
## 2 Isaac Newton   A+      D
## 3 Charles Darwin B       A+

This tells inner_join() that you want to combine all rows in the first table with all rows in the second, where the two rows have the same name. You can use more than one key in a join if you give by a vector of variable names.

In the previous example, each name appears once per table. If a key appears more than once, then the result of an inner join will have a list with all combinations of rows sharing a name.
grades_maths2 <- tribble(
    ~name,            ~grade,
    "Marko Polo",     "D",
    "Isaac Newton",   "A+", # so good at physics
    "Isaac Newton",   "A+", # that he got an A+ twice
    "Charles Darwin", "B"
)
grades_biology2 <- tribble(
    ~name,            ~grade,
    "Marko Polo",     "F",
    "Isaac Newton",   "D",
    "Charles Darwin", "A+", # so good at biology that we
    "Charles Darwin", "A+" # listed him twice
)
inner_join(grades_maths2, grades_biology2, by = "name")
## # A tibble: 5 x 3
##   name           grade.x grade.y
##   <chr>          <chr>   <chr>
## 1 Marko Polo     D       F
## 2 Isaac Newton   A+      D
## 3 Isaac Newton   A+      D
## 4 Charles Darwin B       A+
## 5 Charles Darwin B       A+
inner_join(grades_maths2, grades_biology2, by = "grade")
## # A tibble: 5 x 3
##   name.x       grade name.y
##   <chr>        <chr> <chr>
## 1 Marko Polo   D     Isaac Newton
## 2 Isaac Newton A+    Charles Darwin
## 3 Isaac Newton A+    Charles Darwin
## 4 Isaac Newton A+    Charles Darwin
## 5 Isaac Newton A+    Charles Darwin
In the last join, you see that you can get the same line multiple times from an inner join. Combine the join with distinct() if you want to avoid this.
inner_join(grades_maths2, grades_biology2, by = "grade") %>%
    distinct()
## # A tibble: 2 x 3
##   name.x       grade name.y
##   <chr>        <chr> <chr>
## 1 Marko Polo   D     Isaac Newton
## 2 Isaac Newton A+    Charles Darwin

The tables can have different variables—if they represent different relations—or they can share several or all variables. In the grades example, each table contains the same name/grade relationships, just for different classes. Therefore, they also share variables.

With shared variables, the data from the two tables can be differentiated by a suffix. The default (see previous example) is “.x” and “.y”. You can change that using the suffix argument .
inner_join(
    grades_maths, grades_biology,
    by = "name", suffix = c(".maths", ".biology")
)
## # A tibble: 3 x 3
##   name           grade.maths grade.biology
##   <chr>          <chr>       <chr>
## 1 Marko Polo     D           F
## 2 Isaac Newton   A+          D
## 3 Charles Darwin B           A+

Students might take different classes, and you have several choices on how to combine tables with different keys.

An inner_join() will only give you the rows where a key is in both tables.
grades_geography <- tribble(
    ~name,            ~grade,
    "Marko Polo",     "A",
    "Charles Darwin", "A",
    "Immanuel Kant",  "A+"
)
grades_physics <- tribble(
    ~name,            ~grade,
    "Isaac Newton",    "A+",
    "Albert Einstein", "A+",
    "Charles Darwin",  "C"
)
inner_join(
    grades_geography, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 x 3
##   name           grade.geography grade.physics
##   <chr>          <chr>           <chr>
## 1 Charles Darwin A               C
The full_join() function , in contrast, gives you a table containing all keys. If a key is only found in one of the tables, the variables from the other table will be set to NA.
full_join(
    grades_geography, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 5 x 3
##   name            grade.geography grade.physics
##   <chr>           <chr>           <chr>
## 1 Marko Polo      A               <NA>
## 2 Charles Darwin  A               C
## 3 Immanuel Kant   A+              <NA>
## 4 Isaac Newton    <NA>            A+
## 5 Albert Einstein <NA>            A+
If you want all keys from the left or right table (but not both left and right)—potentially with NA if the other table does not have the key—then you need left_join() or right_join().
left_join(
    grades_geography, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 3 x 3
##   name            grade.geography grade.physics
##   <chr>           <chr>           <chr>
## 1 Marko Polo      A               <NA>
## 2 Charles Darwin  A               C
## 3 Immanuel Kant   A+              <NA>
right_join(
    grades_maths, grades_physics,
    by = "name", suffix = c(".maths", ".physics")
)
## # A tibble: 3 x 3
##   name            grade.maths grade.physics
##   <chr>           <chr>       <chr>
## 1 Isaac Newton    A+          A+
## 2 Albert Einstein <NA>        A+
## 3 Charles Darwin  B           C
A semi_join() will give you all the rows in the first table that contains a key in the second table.
semi_join(
    grades_maths2, grades_biology2,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 4 x 2
##   name           grade
##   <chr>          <chr>
## 1 Marko Polo     D
## 2 Isaac Newton   A+
## 3 Isaac Newton   A+
## 4 Charles Darwin B

You only get one copy per row in the first table when a key matches, regardless of how many times the key appears in the second table.

This is unlike inner_join() where you get all combinations of rows from the two tables. The inner_join() is, therefore, not an analog to an inner_join() combined with a select().
inner_join(
    grades_maths2, grades_biology2,
    by = "name", suffix = c(".geography", ".physics")
) %>% select(1:2)
## # A tibble: 5 x 2
##   name           grade.geography
##   <chr>          <chr>
## 1 Marko Polo     D
## 2 Isaac Newton   A+
## 3 Isaac Newton   A+
## 4 Charles Darwin B
## 5 Charles Darwin B

You can still get multiple identical rows from a semi_join() . If the first table has duplicated rows, you will get the same duplications of the rows. If you do not want that, you can combine the join with distinct().

An anti_join() gives you all the rows in the first table where the key or keys are not found in the second table. Think of it as the complement of semi_join().
anti_join(
    grades_maths2, grades_physics,
    by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 x 2
##   name       grade
##   <chr>      <chr>
## 1 Marko Polo D
The join functions only take two tables as input, so you might wonder how you can combine multiple tables. One solution is to use purrr’s reduce() function :
grades <- list(
    grades_maths, grades_biology,
    grades_geography, grades_physics
)
grades %>%
    reduce(full_join, by = "name") %>%
    rename_at(2:5, ~ c("maths", "biology", "geography", "physics
## # A tibble: 5 x 5
##   name            maths biology geography physics
##   <chr>           <chr> <chr>   <chr>     <chr>
## 1 Marko Polo      D     F       A         <NA>
## 2 Isaac Newton    A+    D       <NA>      A+
## 3 Charles Darwin  B     A+      A         C
## 4 Immanuel Kant   <NA>  <NA>    A+        <NA>
## 5 Albert Einstein <NA>  <NA>    <NA>      A+

The rename_at() function works similarly to select_at(), and here, I use it to rename the last four columns.

Income in Fictional Countries

We had an example with income in fictional countries in Chapter 4. There, we did not have the proper tools needed to deal with missing data. We wanted to replace NA with the mean income for a country in rows with missing data, but the functions from tidyr didn’t suffice. With dplyr we have the tools.

Recall the data:
mean_income <- tribble(
    ~country,  ~`2002`, ~`2003`, ~`2004`,  ~`2005`,
    "Numenor",  123456,  132654,      NA,   324156,
    "Westeros", 314256,  NA,          NA,   465321,
    "Narnia",   432156,  NA,          NA,       NA,
    "Gondor",   531426,  321465,  235461,   463521,
    "Laputa",    14235,   34125,   45123,    51234,
)
The following pipeline does what we want. I will explain it here, but you can try to work out the steps.
mean_income %>%
    gather(
        key = "year",
        value = "mean_income",
        -country
    ) %>% group_by(
        country
    ) %>% mutate(
        mean_per_country = mean(mean_income, na.rm = TRUE),
        mean_income = ifelse(
            is.na(mean_income),
            mean_per_country,
            mean_income
        )
    ) %>% spread(key = "year", value = "mean_income")
## # A tibble: 5 x 6
## # Groups:   country [5]
##   country mean_per_country `2002` `2003` `2004`
##   <chr>              <dbl>  <dbl>  <dbl>  <dbl>
## 1 Gondor           387968. 531426 3.21e5 2.35e5
## 2 Laputa            36179.  14235 3.41e4 4.51e4
## 3 Narnia           432156  432156 4.32e5 4.32e5
## 4 Numenor          193422  123456 1.33e5 1.93e5
## 5 Wester...           389788. 314256 3.90e5 3.90e5
## # ... with 1 more variable: `2005` <dbl>
The first step in the pipeline is reformatting the data. We saw how this work in Chapter 4.
mean_income %>%
    gather(
        key = "year",
        value = "mean_income",
        -country
    )
## # A tibble: 20 x 3
##    country  year  mean_income
##    <chr>    <chr>       <dbl>
##  1 Numenor  2002       123456
##  2 Westeros 2002       314256
##  3 Narnia   2002       432156
##  4 Gondor   2002       531426
##  5 Laputa   2002        14235
##  6 Numenor  2003       132654
##  7 Westeros 2003           NA
##  8 Narnia   2003           NA
##  9 Gondor   2003       321465
## 10 Laputa   2003        34125
## 11 Numenor  2004           NA
## 12 Westeros 2004           NA
## 13 Narnia   2004           NA
## 14 Gondor   2004       235461
## 15 Laputa   2004        45123
## 16 Numenor  2005       324156
## 17 Westeros 2005       465321
## 18 Narnia   2005           NA
## 19 Gondor   2005       463521
## 20 Laputa   2005        51234
In the next two steps, we group by countries, so the means we calculate are per country and not the full data set, and then we compute the means and replace NA cells with them. To get a glimpse into the mutate() call, you can replace it with summarise() and see the mean of each country.
mean_income %>%
    gather(
        key = "year",
        value = "mean_income",
        -country
    ) %>% group_by(
        country
    ) %>% summarise(
        per_country_mean = mean(mean_income, na.rm = TRUE)
    )
## # A tibble: 5 x 2
##   country per_country_mean
##   <chr>              <dbl>
## 1 Gondor           387968.
## 2 Laputa            36179.
## 3 Narnia           432156
## 4 Numenor          193422
## 5 Westeros         389788.

In the next step, we can replace the missing data with their country mean in the mutate() call because we can refer to earlier columns when we create later columns in mutate().

If you had a pipeline where you had steps between the group_by() call and the mutate() call that sets the mean_income , you have to be careful. Do not put an ungroup() in there. The pipeline works because the mean_income is per country when we call mutate(). If we ungrouped, we would get the mean over all countries.
mean_income %>%
    gather(
        key = "year",
        value = "mean_income",
        -country
    ) %>% group_by(
        country
    ) %>% mutate(
        mean_per_country = mean(mean_income, na.rm = TRUE)
    )
## # A tibble: 20 x 4
## # Groups:   country [5]
##    country  year  mean_income mean_per_country
##    <chr>    <chr>       <dbl>            <dbl>
##  1 Numenor  2002       123456          193422
##  2 Westeros 2002       314256          389788.
##  3 Narnia   2002       432156          432156
##  4 Gondor   2002       531426          387968.
##  5 Laputa   2002        14235           36179.
##  6 Numenor  2003       132654          193422
##  7 Westeros 2003           NA          389788.
##  8 Narnia   2003           NA          432156
##  9 Gondor   2003       321465          387968.
## 10 Laputa   2003        34125           36179.
## 11 Numenor  2004           NA          193422
## 12 Westeros 2004           NA          389788.
## 13 Narnia   2004           NA          432156
## 14 Gondor   2004       235461          387968.
## 15 Laputa   2004        45123           36179.
## 16 Numenor  2005       324156          193422
## 17 Westeros 2005       465321          389788.
## 18 Narnia   2005           NA          432156
## 19 Gondor   2005       463521          387968.
## 20 Laputa   2005        51234           36179.
In the last step, we ungroup and update the table before reformatting the tibble again.
mean_income %>%
    gather(
        key = "year",
        value = "mean_income",
        -country
    ) %>% group_by(
        country
    ) %>% mutate(
        mean_per_country = mean(mean_income, na.rm = TRUE)
    ) %>% ungroup(
    ) %>% mutate(
        mean_income = ifelse(
            is.na(mean_income),
            mean_per_country,
            mean_income
        )
    ) %>% spread(key = "year", value = "mean_income")
## # A tibble: 5 x 6
##   country mean_per_country `2002` `2003` `2004`
##   <chr>              <dbl>  <dbl>  <dbl>  <dbl>
## 1 Gondor           387968. 531426 3.21e5 2.35e5
## 2 Laputa            36179.  14235 3.41e4 4.51e4
## 3 Narnia           432156  432156 4.32e5 4.32e5
## 4 Numenor          193422  123456 1.33e5 1.93e5
## 5 Wester...        389788. 314256 3.90e5 3.90e5
## # ...  with 1 more variable: `2005` <dbl>

This is how we replace all missing data with the mean for the corresponding countries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.195.101