The
dplyr package
resembles the functionality in the
purrr package, but it is designed for manipulating data frames. It will be loaded with the
tidyverse package
, but you can also load it using
The usual way that you use dplyr is similar to how you use purrr. You string together a sequence of actions in a pipeline, with the actions separated by the %>% operator. The difference between the two packages is that the purrr functions work on sequences, while the dplyr functions work on data frames.
This package is huge and more functionality is added in each new release, so it would be impossible for me to describe everything you can do with dplyr. I can only give you a flavor of the package functionality and refer you to the documentation for more information.
Selecting Columns
One of the simplest operations you can do on a data frame is selecting one or more of its columns. You can do this by indexing with
$:
iris_df <- as_tibble(iris)
print(iris_df, n = 3)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
This, however, does not work well with pipelines where a data frame is flowing through from action to action. In
dplyr, you have the function
select()
for picking out columns. You can give it one or more column names, and it will give you a data frame containing only those columns.
iris_df %>%
select(Sepal.Length, Species) %>%
print(n = 3)
## # A tibble: 150 x 2
## Sepal.Length Species
## <dbl> <fct>
## 1 5.1 setosa
## 2 4.9 setosa
## 3 4.7 setosa
## # ... with 147 more rows
You can also use complements when selecting columns; just put a minus in front of the column names.
iris_df %>%
select(-Species) %>%
print(n = 3)
## # A tibble: 150 x 4
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 147 more rows, and 1 more variable:
## # Petal.Width <dbl>
iris_df %>%
select(-Species, -Sepal.Length) %>%
print(n = 3)
## # A tibble: 150 x 3
## Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl>
## 1 3.5 1.4 0.2
## 2 3 1.4 0.2
## 3 3.2 1.3 0.2
## # ... with 147 more rows
You do not need to use column names. You can also use
indices.
iris_df %>%
select(1) %>%
print(n = 3)
## # A tibble: 150 x 1
## Sepal.Length
## <dbl>
## 1 5.1
## 2 4.9
## 3 4.7
## # ... with 147 more rows
iris_df %>%
select(1:3) %>%
print(n = 3)
## # A tibble: 150 x 3
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 147 more rows
This, of course, is less informative about what is being selected when you read the code. The second selection in the preceding code shows that you can extract
ranges. You can also do this with column names.
iris_df %>%
select(Petal.Length:Species) %>%
print(n = 3)
## # A tibble: 150 x 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## # ... with 147 more rows
With
select() you can also use various predicates for selecting columns. For example, you can pick columns based on the properties of the names. The
starts_with() function
lets you select columns based on the prefix of the column names and the
ends_with() function
based on the suffix of the column names:
iris_df %>%
select(starts_with("Petal")) %>%
print(n = 3)
## # A tibble: 150 x 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## # ... with 147 more rows
iris_df %>%
select(-starts_with("Petal")) %>%
print(n = 3)
## # A tibble: 150 x 3
## Sepal.Length Sepal.Width Species
## <dbl> <dbl> <fct>
## 1 5.1 3.5 setosa
## 2 4.9 3 setosa
## 3 4.7 3.2 setosa
## # ... with 147 more rows
iris_df %>%
select(starts_with("Petal"), Species) %>%
print(n = 3)
## # A tibble: 150 x 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## # ... with 147 more rows
iris_df %>%
select(starts_with("PETAL", ignore.case = TRUE)) %>%
print(n = 3)
## # A tibble: 150 x 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## # ... with 147 more rows
iris_df %>%
select(starts_with("S")) %>%
print(n = 3)
## # A tibble: 150 x 3
## Sepal.Length Sepal.Width Species
## <dbl> <dbl> <fct>
## 1 5.1 3.5 setosa
## 2 4.9 3 setosa
## 3 4.7 3.2 setosa
## # ... with 147 more rows
iris_df %>%
select(ends_with("Length")) %>%
print(n = 3)
## # A tibble: 150 x 2
## Sepal.Length Petal.Length
## <dbl> <dbl>
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## # ... with 147 more rows
If you use the
contains() function
, you select columns whose names contain a string.
iris_df %>%
select(contains("ng")) %>%
print(n = 3)
## # A tibble: 150 x 2
## Sepal.Length Petal.Length
## <dbl> <dbl>
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## # ... with 147 more rows
The most powerful predicate function is
matches()
. It lets you select columns using regular expressions. For the
iris_df example, there is not much that you can select using a regular expression that you cannot select using one of the simpler functions, but consider this table:
df <- tribble(
~a1x, ~ax, ~a2x, ~b2, ~b2y, ~by,
1, 2, 3, 4, 5, 6
)
We can pick a column that contains an integer using the regular expression
.*\d.*:df %>%
select(matches(".*\d.*")) %>%
print(n = 3)
## # A tibble: 1 x 4
## a1x a2x b2 b2y
## <dbl> <dbl> <dbl> <dbl>
## 1 1 3 4 5
If we need the names to have
non-empty strings before and after the digit, we can use this expression:
df %>%
select(matches(".+\d.+")) %>%
print(n = 3)
## # A tibble: 1 x 3
## a1x a2x b2y
## <dbl> <dbl> <dbl>
## 1 1 3 5
You can also use
select() to rename columns. If you use a named parameter, the columns you select will get the names you use for the function parameters.
iris_df %>%
select(sepal_length = Sepal.Length,
sepal_width = Sepal.Width) %>%
print(n = 3)
## # A tibble: 150 x 2
## sepal_length sepal_width
## <dbl> <dbl>
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## # ... with 147 more rows
Since
select()
removes the columns you do
not select, it might not always be the best approach. A similar function to
select(), where you do not lose columns, is
rename(). With this function, you can
rename columns and keep the remaining columns as well.
iris_df %>%
rename(sepal_length = Sepal.Length,
sepal_width = Sepal.Width) %>%
print(n = 3)
## # A tibble: 150 x 5
## sepal_length sepal_width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
Filter
While select() extracts a subset of columns, the filter() function
does the same for rows.
The
iris_df contains three different species. We can see this using the
distinct() function. This function is also from the
dplyr package, and it gives you all the unique rows from selected columns.
iris_df %>%
distinct(Species)
## # A tibble: 3 x 1
## Species
## <fct>
## 1 setosa
## 2 versicolor
## 3 virginica
We can select the rows where the species is “setosa” using
filter()
:iris_df %>%
filter(Species == "setosa") %>%
print(n = 3)
## # A tibble: 50 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 47 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We can combine
filter() and
select()
in a pipeline to get a subset of the columns as well as a subset of the rows.
iris_df %>%
filter(Species == "setosa") %>%
select(ends_with("Length"), Species) %>%
print(n = 3)
## # A tibble: 50 x 3
## Sepal.Length Petal.Length Species
## <dbl> <dbl> <fct>
## 1 5.1 1.4 setosa
## 2 4.9 1.4 setosa
## 3 4.7 1.3 setosa
## # ... with 47 more rows
Generally, we can string together as many
dplyr (or
purrr) functions as we desire in pipelines.
iris_df %>%
filter(Species != "setosa") %>%
distinct(Species) %>%
print(n = 3)
## # A tibble: 2 x 1
## Species
## <fct>
## 1 versicolor
## 2 virginica
We can use more than one columns to filter by
iris_df %>%
filter(Sepal.Length > 5, Petal.Width < 0.4) %>%
print(n = 3)
## # A tibble: 15 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 5.4 3.7 1.5
## 3 5.8 4 1.2
## # ... with 12 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We can also use functions as a predicate, for example, the
between() function
to select numbers in a given range.
iris_df %>%
filter(between(Sepal.Width, 2, 2.5)) %>%
print(n = 3)
## # A tibble: 19 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.5 2.3 1.3
## 2 5.5 2.3 4
## 3 4.9 2.4 3.3
## # ... with 16 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
We cannot use the functions
starts_with(),
ends_with(), and so on. that we can use with
select()
. This does not mean, however, that we cannot filter rows using string patterns. We just need different functions. For example, we can use
str_starts() from the
stringr package. We return to
stringr in Chapter
.
iris_df %>%
filter(str_starts(Species, "v")) %>%
print(n = 3)
## # A tibble: 100 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 7 3.2 4.7
## 2 6.4 3.2 4.5
## 3 6.9 3.1 4.9
## # ... with 97 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df %>%
filter(str_ends(Species, "r")) %>%
print(n = 3)
## # A tibble: 50 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 7 3.2 4.7
## 2 6.4 3.2 4.5
## 3 6.9 3.1 4.9
## # ... with 47 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
While filter() selects rows by applying predicates on individual columns, other filter variants will use predicates over more than one column.
The filter_all() function
will apply the predicate over all columns. You must provide an expression where you use . (dot) for the table cell value and wrap the expression in one of two functions: any_var() or all_var(). If you use any_var(), it suffices that one column satisfies the predicate for the row to be included; if you use all_var(), then all columns must satisfy the predicate.
We can require that any value in the
iris_df data must be larger than five:
iris_df %>%
select(-Species) %>%
filter_all(any_vars(. > 5)) %>%
print(n = 3)
## # A tibble: 118 x 4
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 5.4 3.9 1.7
## 3 5.4 3.7 1.5
## # ... with 115 more rows, and 1 more variable:
## # Petal.Width <dbl>
The expression
. > 5 is not meaningful if
. is a string, so I had to remove the
Species column before I filtered. Rarely will we accept to lose an informative variable. If you want to keep a column, but not apply the predicate to values in it, you can use the
filter_at() function
.
iris_df %>%
filter_at(vars(-Species), any_vars(. > 5)) %>%
print(n = 3)
## # A tibble: 118 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 5.4 3.9 1.7
## 3 5.4 3.7 1.5
## # ... with 115 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can give
filter_at()
a vector of strings or a vector of variables (unquoted column names). Because I need to negate
Species to select all other columns, I used
vars() here. In other cases, you can use either option.
iris_df %>%
filter_at(c("Petal.Length", "Sepal.Length"),
any_vars(. > 0)) %>%
print(n = 3)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df %>%
filter_at(vars(Petal.Length, Sepal.Length),
any_vars(. > 0)) %>%
print(n = 3)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4
## 2 4.9 3 1.4
## 3 4.7 3.2 1.3
## # ... with 147 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can use
filter_if()
to use a predicate over the columns:
iris_df %>%
filter_if(is.numeric, all_vars(. < 5)) %>%
print(n = 3)
## # A tibble: 22 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.9 3 1.4
## 2 4.7 3.2 1.3
## 3 4.6 3.1 1.5
## # ... with 19 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
You can use functions or lambda expressions as the predicate.
Consider a case where we have numbers with missing data in a table. If we want to filter the rows where all values are greater than three, we can use
filter_all()
:df <- tribble(
~A, ~B, ~C,
1, 2, 3,
4, 5, NA,
11, 12, 13,
22, 22, 1
)
df %>% filter_all(all_vars(. > 3))
## # A tibble: 1 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 11 12 13
It removes the first two and the last row because they all contain values smaller or equal to three. It also deletes the third row because
NA is not considered greater than three. We can restrict the tests to the columns that do not contain missing values:
df %>%
filter_if(~ all(!is.na(.)), all_vars(. > 3))
## # A tibble: 3 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 4 5 NA
## 2 11 12 13
## 3 22 22 1
This removes column C from the tests. It means that we also keep the last row even though C has a value smaller than three.
If you want to keep the two middle row but not the first or last, you can write a more complex predicate and use
filter_all()
.
df %>% filter_all(all_vars(is.na(.) | . > 3))
## # A tibble: 2 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 4 5 NA
## 2 11 12 13
You need to use the | vector-or. The expression needs to be a vector expression so you cannot use ||.
Many dplyr functions have _all, _at, and _if variants, but I will only describe the frequently used functions here. For the others, I refer to the package documentation.
Sorting
If you want to
sort rows by values in selected columns, the function you want is called
arrange()
. You can sort by one or more columns:
iris_df %>%
arrange(Petal.Length) %>%
print(n = 5)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.6 3.6 1
## 2 4.3 3 1.1
## 3 5.8 4 1.2
## 4 5 3.2 1.2
## 5 4.7 3.2 1.3
## # ... with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
iris_df %>%
arrange(Sepal.Length, Petal.Length) %>%
print(n = 5)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 4.3 3 1.1
## 2 4.4 3 1.3
## 3 4.4 3.2 1.3
## 4 4.4 2.9 1.4
## 5 4.5 2.3 1.3
## # ... with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
If you want to
sort in descending order, you can use the function
desc()
.
iris_df %>%
arrange(desc(Petal.Length)) %>%
print(n = 5)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length
## <dbl> <dbl> <dbl>
## 1 7.7 2.6 6.9
## 2 7.7 3.8 6.7
## 3 7.7 2.8 6.7
## 4 7.6 3 6.6
## 5 7.9 3.8 6.4
## # ... with 145 more rows, and 2 more variables:
## # Petal.Width <dbl>, Species <fct>
Modifying Data Frames
When we work with data frames, we usually want to compute values based on the variables (columns) in our tables. Filtering rows and columns will only get us so far.
The
mutate() function
lets us add columns to a data frame based on expressions that can involve any of the existing columns. Consider a table of widths and heights.
df <- tribble(
~height, ~width,
10, 12,
42, 24,
14, 12
)
We can add an area column using
mutate()
and this expression:
df %>% mutate(area = height * width)
## # A tibble: 3 x 3
## height width area
## <dbl> <dbl> <dbl>
## 1 10 12 120
## 2 42 24 1008
## 3 14 12 168
If you
add columns to a data frame, you can refer to variables you have already added. For example, if your height and width data are in inches and you want them in centimeters, and you also want the area in centimeters squared, you can do this:
cm_per_inch = 2.54
df %>% mutate(
height_cm = cm_per_inch * height,
width_cm = cm_per_inch * width,
area_cm = height_cm * width_cm
)
## # A tibble: 3 x 5
## height width height_cm width_cm area_cm
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10 12 25.4 30.5 774.
## 2 42 24 107. 61.0 6503.
## 3 14 12 35.6 30.5 1084.
The following expression, however, will not work since you cannot refer to variables you are adding later in the
mutate() call.
df %>% mutate(
area_cm = height_cm * width_cm,
height_cm = cm_per_inch * height,
width_cm = cm_per_inch * width
)
If you do not give the new variable a name, that is, you call
mutate() with named parameters, then the expression will become the variable name:
df %>% mutate(cm_per_inch * height)
## # A tibble: 3 x 3
## height width `cm_per_inch * height`
## <dbl> <dbl> <dbl>
## 1 10 12 25.4
## 2 42 24 107.
## 3 14 12 35.6
In this example, the units (inches and centimeters), are not encoded with the data, and in future computations, you might end up mixing the wrong units. If you use the
units package
—which is not part of the Tidyverse—you can make the units explicit, automatically convert between units where that makes sense, and you will get unit type safety as a bonus.
df %>% mutate(
height_in = units::as_units(height, "in"),
width_in = units::as_units(width, "in"),
area_in = height_in * width_in,
height_cm = units::set_units(height_in, "cm"),
width_cm = units::set_units(width_in, "cm"),
area_cm = units::set_units(area_in, "cm^2")
)
## # A tibble: 3 x 8
## height width height_in width_in area_in
## <dbl> <dbl> [in] [in] [in^2]
## 1 10 12 10 12 120
## 2 42 24 42 24 1008
## 3 14 12 14 12 168
## # ... with 3 more variables: height_cm [cm],
## # width_cm [cm], area_cm [cm^2]
This is possible because tibbles can contain classes of various types, which the units package exploits. Tibbles use an interface that lets you define how your own classes should be displayed, but this is beyond the scope of this book.
In the preceding example, we created unit-carrying values for the original data table, but the original height and weight columns are still around. We could use
select() to remove them, but the function
transmute()
already does this for us.
df %>% transmute(
height_in = units::as_units(height, "in"),
width_in = units::as_units(width, "in"),
area_in = height_in * width_in,
height_cm = units::set_units(height_in, "cm"),
width_cm = units::set_units(width_in, "cm"),
area_cm = units::set_units(area_in, "cm^2")
)
## # A tibble: 3 x 6
## height_in width_in area_in height_cm width_cm
## [in] [in] [in^2] [cm] [cm]
## 1 10 12 120 25.40 30.48
## 2 42 24 1008 106.68 60.96
## 3 14 12 168 35.56 30.48
## # ... with 1 more variable: area_cm [cm^2]
When you use
transmute()
instead of mutate(), the variables that you are not assigning to will be removed.
As with filtering, the expressions you use when mutating a data frame must be
vector expressions. Consider this example:
df <- tibble(
x = rnorm(3, mean = 12, sd = 5),
)
my_abs <- function(x) if (x < 0) -x else x
df %>% mutate(my_abs(x))
## Warning in if (x < 0) -x else x: the condition has
## length > 1 and only the first element will be used
## # A tibble: 3 x 2
## x `my_abs(x)`
## <dbl> <dbl>
## 1 16.1 16.1
## 2 -2.07 -2.07
## 3 14.9 14.9
It fails because the function my_abs() (my version of the abs() function) is not a vector expression. It is not a vector expression because of x in the if expression.
If we use the built-in
abs() function
, then there is no problem with the expression; it handles vectors.
## # A tibble: 3 x 2
## x `abs(x)`
## <dbl> <dbl>
## 1 16.1 16.1
## 2 -2.07 2.07
## 3 14.9 14.9
If you use the
ifelse() vector version of
if expressions, you get a vector expression.
ifelse_abs <- function(x) ifelse(x < 0, -x, x)
df %>% mutate(ifelse_abs(x))
## # A tibble: 3 x 2
## x `ifelse_abs(x)`
## <dbl> <dbl>
## 1 16.1 16.1
## 2 -2.07 2.07
## 3 14.9 14.9
Or you can use the
Vectorize() function
to make a vector expression out of a function that does not handle vectors.
my_abs <- Vectorize(my_abs)
df %>% mutate(my_abs(x))
## # A tibble: 3 x 2
## x `my_abs(x)`
## <dbl> <dbl>
## 1 16.1 16.1
## 2 -2.07 2.07
## 3 14.9 14.9
If you need to map the input to several different output values, you can nest ifelse() expressions arbitrarily deep, but it gets difficult to read. A function that alleviates this problem substantially is case_when(). You can give it a sequence of predicates with matching expressions, and it will return the expression for the first matching predicate.
Consider this example where we use
case_when()
to categorize the variable
x based on its value:
df <- tibble(x = rnorm(100))
df %>%
mutate(
x_category = case_when(
x - mean(x) < -2 * sd(x) ~ "small",
x - mean(x) > 2 * sd(x) ~ "large",
TRUE ~ "medium"
)
) %>%
print(n = 3)
## # A tibble: 100 x 2
## x x_category
## <dbl> <chr>
## 1 -0.311 medium
## 2 0.0992 medium
## 3 -1.26 medium
## # ... with 97 more rows
The TRUE line corresponds to the else part of an if-else statement.
Grouping and Summarizing
In many analyses, we need summary statistics of our data. If you can map one or more of your columns into a single summary, you can use the
summarise() function
.
df <- tibble(x = rnorm(100), y = rnorm(100))
df %>% summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 1 x 2
## mean_x mean_y
## <dbl> <dbl>
## 1 -0.0529 0.0141
In this example, we summarized our data as the mean of variables x and y.
If you split your data into different classes and want to work on the data per group, you can use the function
group_by()
.
classify <- function(x) {
case_when(
x - mean(x) < -2 * sd(x) ~ "small",
x - mean(x) > 2 * sd(x) ~ "large",
TRUE ~ "medium"
)
}
df %>%
mutate(x_category = classify(x)) %>%
group_by(x_category) %>%
print(n = 3)
## # A tibble: 100 x 3
## # Groups: x_category [2]
## x y x_category
## <dbl> <dbl> <chr>
## 1 -0.702 0.959 medium
## 2 1.10 0.557 medium
## 3 -0.667 0.399 medium
## # ... with 97 more rows
The result is a data frame that, when we print it, doesn’t look different from before we grouped the data.
df %>%
mutate(x_category = classify(x)) %>%
print(n = 3)
## # A tibble: 100 x 3
## x y x_category
## <dbl> <dbl> <chr>
## 1 -0.702 0.959 medium
## 2 1.10 0.557 medium
## 3 -0.667 0.399 medium
## # ... with 97 more rows
In the header, however, you will notice the line
This tells you that something
is different when you have grouped your data. This is apparent when you combine grouping with summarizing.
df %>%
mutate(x_category = classify(x)) %>%
group_by(x_category) %>%
summarise(mean_x = mean(x), no_x = n())
## # A tibble: 2 x 3
## x_category mean_x no_x
## <chr> <dbl> <int>
## 1 large 2.34 4
## 2 medium -0.153 96
When you group your data and then summarize, you get a per-group summary statistics. Here, we calculate the mean and the number of observations in each category (the function n() gives you the number of observations).
You can get the variables your data is grouped by using the
group_vars() function
.
df %>%
mutate(x_category = classify(x)) %>%
group_by(x_category) %>%
group_vars()
You can group data by more than one
variable.
df <- tibble(x = rnorm(100), y = rnorm(100))
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
group_by(x_category, y_category) %>%
group_vars()
## [1] "x_category" "y_category"
If you do, you will get summaries for each combination of the grouping variables.
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
group_by(x_category, y_category) %>%
summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 6 x 4
## # Groups: x_category [3]
## x_category y_category mean_x mean_y
## <chr> <chr> <dbl> <dbl>
## 1 large medium 2.02 -0.836
## 2 medium large 0.871 1.98
## 3 medium medium -0.118 0.0213
## 4 medium small -0.679 -2.69
## 5 small large -3.48 1.96
## 6 small medium -2.74 -0.0693
If you group by one variable and summarize another, you get summaries for both variables according for each of the
grouped by variables.
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
group_by(x_category) %>%
summarise(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 3 x 3
## x_category mean_x mean_y
## <chr> <dbl> <dbl>
## 1 large 2.02 -0.836
## 2 medium -0.114 0.0133
## 3 small -2.98 0.607
Here, we calculate the mean of x and y for each category of x value. In general, you can use any number of variables to group the data and any number of variables to summarize, and the summaries will be computed per group.
When you have grouped your data by more than one variable, a summary will remove the last group you added, that is, the last variable you grouped by. A summary gives you a new data frame containing the grouping variable and the summaries. If you have grouped with more than one variable, this data frame will be grouped by all but the last grouping variables.
For example, if we group by variables
A and
B and then summarise another variable,
C, the result will be grouped by
A and not
B. If you group by
A,
B, and
D, the result will be a data frame grouped by
A and
B.
df2 <- tribble(
~A, ~B, ~C, ~D,
"left", "up", 2, "yes",
"right", "up", 5, "no",
"left", "down", 2, "yes",
"left", "down", 7, "no",
"left", "down", 3, "no",
"right", "up", 8, "yes",
"right", "up", 2, "yes",
"right", "up", 8, "no"
)
df2 %>% group_by(A, B) %>%
summarise(min_c = min(C), max_c = max(C))
## # A tibble: 3 x 4
## # Groups: A [2]
## A B min_c max_c
## <chr> <chr> <dbl> <dbl>
## 1 left down 2 7
## 2 left up 2 2
## 3 right up 2 8
df2 %>% group_by(A, B) %>%
summarise(min_c = min(C), max_c = max(C)) %>%
summarise(max_diff = max(max_c - min_c))
## # A tibble: 2 x 2
## A max_diff
## <chr> <dbl>
## 1 left 5
## 2 right 6
df2 %>% group_by(A, B, D) %>%
summarise(min_c = min(C), max_c = max(C))
## # A tibble: 5 x 5
## # Groups: A, B [3]
## A B D min_c max_c
## <chr> <chr> <chr> <dbl> <dbl>
## 1 left down no 3 7
## 2 left down yes 2 2
## 3 left up yes 2 2
## 4 right up no 5 8
## 5 right up yes 2 8
df2 %>% group_by(A, B, D) %>%
summarise(min_c = min(C), max_c = max(C)) %>%
summarise(max_diff = max(max_c - min_c))
## # A tibble: 3 x 3
## # Groups: A [2]
## A B max_diff
## <chr> <chr> <dbl>
## 1 left down 4
## 2 left up 0
## 3 right up 6
The order you use in
group_by()
is important here. If you flip the variables, you will get a data frame that is still grouped by
B instead.
df2 %>% group_by(A, B) %>%
summarise(min_c = min(C), max_c = max(C))
## # A tibble: 3 x 4
## # Groups: A [2]
## A B min_c max_c
## <chr> <chr> <dbl> <dbl>
## 1 left down 2 7
## 2 left up 2 2
## 3 right up 2 8
The way grouping variables are removed one by one each time you call
summarise()
is a potential source of errors. Say you have variables A, B, C, and D, grouped by A, B, and D. Within each combination of these, you define a distance as the maximum value of C minus its minimum value. You want to know the smallest difference between maximum and minimum in the overall data. You might, reasonably, try this:
You compute the minimum and maximum per group:
df2 %>% group_by(A, B, D) %>%
summarise(min_c = min(C), max_c = max(C)) %>%
## Error: <text>:3:0: unexpected end of input
## 1: df2 %>% group_by(A, B, D) %>%
## 2: summarise(min_c = min(C), max_c = max(C)) %>%
## ^
Then you try another
summarise()
to find the minimum between
max_c and
min_c for each row in the result:
df2 %>% group_by(A, B, D) %>%
summarise(min_c = min(C), max_c = max(C)) %>%
summarise(min_diff = min(max_c - min_c))
## # A tibble: 3 x 3
## # Groups: A [2]
## A B min_diff
## <chr> <chr> <dbl>
## 1 left down 0
## 2 left up 0
## 3 right up 3
This, however, gives you the smallest difference for each combination of A and B, but not the overall smallest. The first summarise only removes the D group (and the second removes B).
If you have grouped your data, or you get grouped data from one summary as in the preceding example, you can remove the groups again using
ungroup()
. Once you have done this, you can compute global summaries again.
df2 %>% group_by(A, B, D) %>%
summarise(min_c = min(C), max_c = max(C)) %>%
ungroup() %>%
summarise(min_diff = min(max_c - min_c))
## # A tibble: 1 x 1
## min_diff
## <dbl>
## 1 0
Grouping is not only useful when collecting summaries of your data. You can also add columns to your data frame based on per-group computations using
mutate()
:df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
group_by(x_category) %>%
mutate(mean_x = mean(x), mean_y = mean(y)) %>%
print(n = 5)
## # A tibble: 100 x 6
## # Groups: x_category [3]
## x y x_category y_category mean_x
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 0.352 1.23 medium medium -0.114
## 2 -2.84 -0.705 small medium -2.98
## 3 0.332 -0.0964 medium medium -0.114
## 4 1.37 -1.28 medium medium -0.114
## 5 -0.893 -0.373 medium medium -0.114
## # ... with 95 more rows, and 1 more variable:
## # mean_y <dbl>
When you calculate a summary and then mutate, you create columns where a summary parameter takes values based on the groupings, but they are combined with the existing data rather than extracted as a new data frame as summarise() would do.
A
mutate()
without a
group_by() will give you summaries for the entire data.
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
mutate(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 100 x 6
## x y x_category y_category mean_x
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 0.352 1.23 medium medium -0.157
## 2 -2.84 -0.705 small medium -0.157
## 3 0.332 -0.0964 medium medium -0.157
## 4 1.37 -1.28 medium medium -0.157
## 5 -0.893 -0.373 medium medium -0.157
## 6 -0.561 0.465 medium medium -0.157
## 7 -1.14 -0.874 medium medium -0.157
## 8 0.812 -0.897 medium medium -0.157
## 9 0.162 0.634 medium medium -0.157
## 10 -1.12 -0.311 medium medium -0.157
## # ... with 90 more rows, and 1 more variable:
## # mean_y <dbl>
In contrast, if you group before adding variables to the data, then the summaries are per group.
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
group_by(x_category) %>%
mutate(mean_x = mean(x), mean_y = mean(y))
## # A tibble: 100 x 6
## # Groups: x_category [3]
## x y x_category y_category mean_x
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 0.352 1.23 medium medium -0.114
## 2 -2.84 -0.705 small medium -2.98
## 3 0.332 -0.0964 medium medium -0.114
## 4 1.37 -1.28 medium medium -0.114
## 5 -0.893 -0.373 medium medium -0.114
## 6 -0.561 0.465 medium medium -0.114
## 7 -1.14 -0.874 medium medium -0.114
## 8 0.812 -0.897 medium medium -0.114
## 9 0.162 0.634 medium medium -0.114
## 10 -1.12 -0.311 medium medium -0.114
## # ... with 90 more rows, and 1 more variable:
## # mean_y <dbl>
You can combine data-wide summaries with grouped summaries if you calculate the
global summaries before you group.
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
mutate(mean_y = mean(y)) %>%
group_by(x_category) %>%
mutate(mean_x = mean(x)) %>%
distinct(mean_x, mean_y)
## # A tibble: 3 x 3
## # Groups: x_category [3]
## mean_x mean_y x_category
## <dbl> <dbl> <chr>
## 1 -0.114 0.0142 medium
## 2 -2.98 0.0142 small
## 3 2.02 0.0142 large
Here I used
mutate() + distinct()
rather than summarise() to keep all the variables in the output. With
summarise()
, I would only keep the grouping variables and the summaries.
If you need to compute summaries with different groups, for example, the mean for
x for each
x_category as well as the mean of
y for each
y_category, then you can call
group_by(x_category) for summarizing
x, followed by
group_by(y_category) for changing the grouping so you can summarize
y.
df %>%
mutate(
x_category = classify(x),
y_category = classify(y)
) %>%
group_by(x_category) %>%
mutate(mean_x = mean(x)) %>%
group_by(y_category) %>%
mutate(mean_y = mean(y)) %>%
distinct(
x_category, mean_x,
y_category, mean_y
)
## # A tibble: 6 x 4
## # Groups: y_category [3]
## x_category mean_x y_category mean_y
## <chr> <dbl> <chr> <dbl>
## 1 medium -0.114 medium 0.00175
## 2 small -2.98 medium 0.00175
## 3 medium -0.114 large 1.97
## 4 medium -0.114 small -2.69
## 5 small -2.98 large 1.97
## 6 large 2.02 medium 0.00175
Joining Tables
It is not uncommon to have your data in more than one table. This could be because the tables are created from different calculations, for example, different kinds of summaries, or it can be because you got the data from different files.
If you merely need to combine tables by row or column, then you can use the
bind_rows()
and
bind_columns()
functions which do precisely what you would expect.
df1 <- tibble(
A = paste0("a", 1:2),
B = paste0("b", 1:2)
)
df2 <- tibble(
A = paste0("a", 3:4),
B = paste0("b", 3:4)
)
df3 <- tibble(
C = paste0("c", 1:2),
D = paste0("d", 1:2)
)
bind_rows(df1, df2)
## # A tibble: 4 x 2
## A B
## <chr> <chr>
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
## 4 a4 b4
## # A tibble: 2 x 4
## A B C D
## <chr> <chr> <chr> <chr>
## 1 a1 b1 c1 d1
## 2 a2 b2 c2 d2
When you combine rows, the tables must have the same columns, and when you combine by column, the tables must have the same number of rows.
If you have tables that represent different relations between variables—the underlying principle of relational databases aimed at avoiding duplicated data—then you can combine them using join functions.
Say you have a table that maps students to grades for each class you teach. You can join tables from two different classes using
inner_join()
. You use a key to join them on, specified by argument
by.
grades_maths <- tribble(
~name, ~grade,
"Marko Polo", "D",
"Isaac Newton", "A+",
"Charles Darwin", "B"
)
grades_biology <- tribble(
~name, ~grade,
"Marko Polo", "F",
"Isaac Newton", "D",
"Charles Darwin", "A+"
)
inner_join(grades_maths, grades_biology, by = "name")
## # A tibble: 3 x 3
## name grade.x grade.y
## <chr> <chr> <chr>
## 1 Marko Polo D F
## 2 Isaac Newton A+ D
## 3 Charles Darwin B A+
This tells
inner_join()
that you want to combine all rows in the first table with all rows in the second, where the two rows have the same name. You can use more than one key in a join if you give by a vector of variable names.
In the previous example, each name appears once per table. If a key appears more than once, then the result of an inner join will have a list with all combinations of rows sharing a name.
grades_maths2 <- tribble(
~name, ~grade,
"Marko Polo", "D",
"Isaac Newton", "A+", # so good at physics
"Isaac Newton", "A+", # that he got an A+ twice
"Charles Darwin", "B"
)
grades_biology2 <- tribble(
~name, ~grade,
"Marko Polo", "F",
"Isaac Newton", "D",
"Charles Darwin", "A+", # so good at biology that we
"Charles Darwin", "A+" # listed him twice
)
inner_join(grades_maths2, grades_biology2, by = "name")
## # A tibble: 5 x 3
## name grade.x grade.y
## <chr> <chr> <chr>
## 1 Marko Polo D F
## 2 Isaac Newton A+ D
## 3 Isaac Newton A+ D
## 4 Charles Darwin B A+
## 5 Charles Darwin B A+
inner_join(grades_maths2, grades_biology2, by = "grade")
## # A tibble: 5 x 3
## name.x grade name.y
## <chr> <chr> <chr>
## 1 Marko Polo D Isaac Newton
## 2 Isaac Newton A+ Charles Darwin
## 3 Isaac Newton A+ Charles Darwin
## 4 Isaac Newton A+ Charles Darwin
## 5 Isaac Newton A+ Charles Darwin
In the last join, you see that you can get the same line multiple times from an inner join. Combine the join with
distinct()
if you want to avoid this.
inner_join(grades_maths2, grades_biology2, by = "grade") %>%
distinct()
## # A tibble: 2 x 3
## name.x grade name.y
## <chr> <chr> <chr>
## 1 Marko Polo D Isaac Newton
## 2 Isaac Newton A+ Charles Darwin
The tables can have different variables—if they represent different relations—or they can share several or all variables. In the grades example, each table contains the same name/grade relationships, just for different classes. Therefore, they also share variables.
With shared variables, the data from the two tables can be differentiated by a suffix. The default (see previous example) is “.x” and “.y”. You can change that using the
suffix argument
.
inner_join(
grades_maths, grades_biology,
by = "name", suffix = c(".maths", ".biology")
)
## # A tibble: 3 x 3
## name grade.maths grade.biology
## <chr> <chr> <chr>
## 1 Marko Polo D F
## 2 Isaac Newton A+ D
## 3 Charles Darwin B A+
Students might take different classes, and you have several choices on how to combine tables with different keys.
An
inner_join()
will only give you the rows where a key is in both tables.
grades_geography <- tribble(
~name, ~grade,
"Marko Polo", "A",
"Charles Darwin", "A",
"Immanuel Kant", "A+"
)
grades_physics <- tribble(
~name, ~grade,
"Isaac Newton", "A+",
"Albert Einstein", "A+",
"Charles Darwin", "C"
)
inner_join(
grades_geography, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 x 3
## name grade.geography grade.physics
## <chr> <chr> <chr>
## 1 Charles Darwin A C
The
full_join() function
, in contrast, gives you a table containing all keys. If a key is only found in one of the tables, the variables from the other table will be set to
NA.
full_join(
grades_geography, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 5 x 3
## name grade.geography grade.physics
## <chr> <chr> <chr>
## 1 Marko Polo A <NA>
## 2 Charles Darwin A C
## 3 Immanuel Kant A+ <NA>
## 4 Isaac Newton <NA> A+
## 5 Albert Einstein <NA> A+
If you want all keys from the left or right table (but not both left and right)—potentially with
NA if the other table does not have the key—then you need
left_join() or
right_join().
left_join(
grades_geography, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 3 x 3
## name grade.geography grade.physics
## <chr> <chr> <chr>
## 1 Marko Polo A <NA>
## 2 Charles Darwin A C
## 3 Immanuel Kant A+ <NA>
right_join(
grades_maths, grades_physics,
by = "name", suffix = c(".maths", ".physics")
)
## # A tibble: 3 x 3
## name grade.maths grade.physics
## <chr> <chr> <chr>
## 1 Isaac Newton A+ A+
## 2 Albert Einstein <NA> A+
## 3 Charles Darwin B C
A
semi_join()
will give you all the rows in the first table that contains a key in the second table.
semi_join(
grades_maths2, grades_biology2,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 4 x 2
## name grade
## <chr> <chr>
## 1 Marko Polo D
## 2 Isaac Newton A+
## 3 Isaac Newton A+
## 4 Charles Darwin B
You only get one copy per row in the first table when a key matches, regardless of how many times the key appears in the second table.
This is unlike
inner_join()
where you get all combinations of rows from the two tables. The
inner_join() is, therefore, not an analog to an
inner_join() combined with a
select().
inner_join(
grades_maths2, grades_biology2,
by = "name", suffix = c(".geography", ".physics")
) %>% select(1:2)
## # A tibble: 5 x 2
## name grade.geography
## <chr> <chr>
## 1 Marko Polo D
## 2 Isaac Newton A+
## 3 Isaac Newton A+
## 4 Charles Darwin B
## 5 Charles Darwin B
You can still get multiple identical rows from a
semi_join()
. If the first table has duplicated rows, you will get the same duplications of the rows. If you do not want that, you can combine the join with distinct().
An
anti_join()
gives you all the rows in the first table where the key or keys are not found in the second table. Think of it as the complement of
semi_join().
anti_join(
grades_maths2, grades_physics,
by = "name", suffix = c(".geography", ".physics")
)
## # A tibble: 1 x 2
## name grade
## <chr> <chr>
## 1 Marko Polo D
The join functions only take two tables as input, so you might wonder how you can combine multiple tables. One solution is to use
purrr’s
reduce() function
:
grades <- list(
grades_maths, grades_biology,
grades_geography, grades_physics
)
grades %>%
reduce(full_join, by = "name") %>%
rename_at(2:5, ~ c("maths", "biology", "geography", "physics
## # A tibble: 5 x 5
## name maths biology geography physics
## <chr> <chr> <chr> <chr> <chr>
## 1 Marko Polo D F A <NA>
## 2 Isaac Newton A+ D <NA> A+
## 3 Charles Darwin B A+ A C
## 4 Immanuel Kant <NA> <NA> A+ <NA>
## 5 Albert Einstein <NA> <NA> <NA> A+
The rename_at() function
works similarly to select_at(), and here, I use it to rename the last four columns.
Income in Fictional Countries
We had an example with income in fictional countries in Chapter . There, we did not have the proper tools needed to deal with missing data. We wanted to replace NA with the mean income for a country in rows with missing data, but the functions from tidyr didn’t suffice. With
dplyr
we have the tools.
Recall the data:
mean_income <- tribble(
~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,
"Numenor", 123456, 132654, NA, 324156,
"Westeros", 314256, NA, NA, 465321,
"Narnia", 432156, NA, NA, NA,
"Gondor", 531426, 321465, 235461, 463521,
"Laputa", 14235, 34125, 45123, 51234,
)
The following pipeline does what we want. I will explain it here, but you can try to work out the steps.
mean_income %>%
gather(
key = "year",
value = "mean_income",
-country
) %>% group_by(
country
) %>% mutate(
mean_per_country = mean(mean_income, na.rm = TRUE),
mean_income = ifelse(
is.na(mean_income),
mean_per_country,
mean_income
)
) %>% spread(key = "year", value = "mean_income")
## # A tibble: 5 x 6
## # Groups: country [5]
## country mean_per_country `2002` `2003` `2004`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gondor 387968. 531426 3.21e5 2.35e5
## 2 Laputa 36179. 14235 3.41e4 4.51e4
## 3 Narnia 432156 432156 4.32e5 4.32e5
## 4 Numenor 193422 123456 1.33e5 1.93e5
## 5 Wester... 389788. 314256 3.90e5 3.90e5
## # ... with 1 more variable: `2005` <dbl>
The first step in the pipeline is reformatting the
data. We saw how this work in Chapter
.
mean_income %>%
gather(
key = "year",
value = "mean_income",
-country
)
## # A tibble: 20 x 3
## country year mean_income
## <chr> <chr> <dbl>
## 1 Numenor 2002 123456
## 2 Westeros 2002 314256
## 3 Narnia 2002 432156
## 4 Gondor 2002 531426
## 5 Laputa 2002 14235
## 6 Numenor 2003 132654
## 7 Westeros 2003 NA
## 8 Narnia 2003 NA
## 9 Gondor 2003 321465
## 10 Laputa 2003 34125
## 11 Numenor 2004 NA
## 12 Westeros 2004 NA
## 13 Narnia 2004 NA
## 14 Gondor 2004 235461
## 15 Laputa 2004 45123
## 16 Numenor 2005 324156
## 17 Westeros 2005 465321
## 18 Narnia 2005 NA
## 19 Gondor 2005 463521
## 20 Laputa 2005 51234
In the next two steps, we group by countries, so the means we calculate are per country and not the full data set, and then we compute the means and replace
NA cells with them. To get a glimpse into the
mutate()
call, you can replace it with
summarise() and see the mean of each country.
mean_income %>%
gather(
key = "year",
value = "mean_income",
-country
) %>% group_by(
country
) %>% summarise(
per_country_mean = mean(mean_income, na.rm = TRUE)
)
## # A tibble: 5 x 2
## country per_country_mean
## <chr> <dbl>
## 1 Gondor 387968.
## 2 Laputa 36179.
## 3 Narnia 432156
## 4 Numenor 193422
## 5 Westeros 389788.
In the next step, we can replace the missing data with their country mean in the mutate() call because we can refer to earlier columns when we create later columns in mutate().
If you had a pipeline where you had steps between the
group_by() call and the
mutate() call that sets the
mean_income
, you have to be careful. Do
not put an
ungroup() in there. The pipeline works because the
mean_income is per country when we call
mutate(). If we ungrouped, we would get the mean over
all countries.
mean_income %>%
gather(
key = "year",
value = "mean_income",
-country
) %>% group_by(
country
) %>% mutate(
mean_per_country = mean(mean_income, na.rm = TRUE)
)
## # A tibble: 20 x 4
## # Groups: country [5]
## country year mean_income mean_per_country
## <chr> <chr> <dbl> <dbl>
## 1 Numenor 2002 123456 193422
## 2 Westeros 2002 314256 389788.
## 3 Narnia 2002 432156 432156
## 4 Gondor 2002 531426 387968.
## 5 Laputa 2002 14235 36179.
## 6 Numenor 2003 132654 193422
## 7 Westeros 2003 NA 389788.
## 8 Narnia 2003 NA 432156
## 9 Gondor 2003 321465 387968.
## 10 Laputa 2003 34125 36179.
## 11 Numenor 2004 NA 193422
## 12 Westeros 2004 NA 389788.
## 13 Narnia 2004 NA 432156
## 14 Gondor 2004 235461 387968.
## 15 Laputa 2004 45123 36179.
## 16 Numenor 2005 324156 193422
## 17 Westeros 2005 465321 389788.
## 18 Narnia 2005 NA 432156
## 19 Gondor 2005 463521 387968.
## 20 Laputa 2005 51234 36179.
In the last step, we
ungroup and update the table before reformatting the tibble again.
mean_income %>%
gather(
key = "year",
value = "mean_income",
-country
) %>% group_by(
country
) %>% mutate(
mean_per_country = mean(mean_income, na.rm = TRUE)
) %>% ungroup(
) %>% mutate(
mean_income = ifelse(
is.na(mean_income),
mean_per_country,
mean_income
)
) %>% spread(key = "year", value = "mean_income")
## # A tibble: 5 x 6
## country mean_per_country `2002` `2003` `2004`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gondor 387968. 531426 3.21e5 2.35e5
## 2 Laputa 36179. 14235 3.41e4 4.51e4
## 3 Narnia 432156 432156 4.32e5 4.32e5
## 4 Numenor 193422 123456 1.33e5 1.93e5
## 5 Wester... 389788. 314256 3.90e5 3.90e5
## # ... with 1 more variable: `2005` <dbl>
This is how we replace all missing data with the mean for the corresponding countries.