Even if we only consider tabular data, there are still many different ways to format this data. The packages in the Tidyverse
expect that data is represented
as so-called “tidy data”1 (which is where the name Tidyverse comes from). The tidyr package
helps you with formatting your data into tidy data.
You can load
tidyr as part of the tidyverse:
Tidy Data
The fundamental properties that characterize tidy data are that each variable is in a column, and each observation is a row. The terms like variables and observations should be familiar from statistics, but what the tidy data properties
say is that you should not put the values of a variable in different columns and you should not put more than one observation in the same row.
Consider data such as this:
mean_income<-tribble(
~country,~ '2001',~ '2002',~ '2003',~ '2004',~ '2005',
"Numenor", 123456, 132654, 321646, 324156, 325416,
"Westeros", 314256, 432165, 546123, 465321, 561423,
"Narnia", 432156, 342165, 564123, 543216, 465321,
"Gondor", 531426, 321465, 235461, 463521, 561423,
"Laputa", 14235, 34125, 45123, 51234, 54321
)
We could imagine that this table contains the mean income in five (fictional) countries in the first five years of the 21st century. Let us say that you are interested in knowing if there is a pattern in income over time when taking the different countries into account. To do this, you might consider a formula such as income ~ country + year, but the data is not formatted in a way that makes this easy. If year is a variable that you can use in the model, it is a problem that different observations, the actual years, are split into different columns. Similarly, if the mean income per year is a variable, it is a problem that all the observations for a country are in the same row. You cannot use income ~ country + year because the data is not tidy.
Pivoting
The function
pivot_longer() can merge
several columns into two—a key and a value column. The key column will be a discrete variable with values taken from the column names, and the value column will contain the data from the original columns.
Consider the mean income data frame. We want the actual years to be different outcomes of a year variable and have the income for each year to be another variable,
income. We can get such a tibble by calling
pivot_longer() with these parameters:
mean_income |> pivot_longer(
names_to = "year",
values_to = "income",
cols = c(`2001`, `2001`, `2002`,
`2003`, `2004`, `2005`)
)
## # A tibble: 25 × 3
## country year income
## <chr> <chr> <dbl>
## 1 Numenor 2001 123456
## 2 Numenor 2002 132654
## 3 Numenor 2003 321646
## 4 Numenor 2004 324156
## 5 Numenor 2005 325416
## 6 Westeros 2001 314256
## 7 Westeros 2002 432165
## 8 Westeros 2003 546123
## 9 Westeros 2004 465321
## 10 Westeros 2005 561423
## # . . . with 15 more rows
The first argument is the input data from mean_income, and I provide it using the |> operator. This works exactly as if I had given
mean_income
as the first argument to the function call, but using the operator signals to the reader that mean_income is data flowing into the function and different from other arguments because of it.
The names_to and values_to arguments
are the names of two new columns. In the names_to column, we get the names of the columns we get the values from, and in the value column, we get the matching values from those columns. The remaining parameters specify which columns we wish to merge.
In the preceding example, I explicitly listed all the columns to include, but
tidyr’s functions support tidy select (see Chapter
4), so you can also use a range of columns, like this:
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
cols = '2001':'2005'
)
## # A tibble: 25 × 3
## country year income
## <chr> <chr> <dbl>
## 1 Numenor 2001 123456
## 2 Numenor 2002 132654
## 3 Numenor 2003 321646
## 4 Numenor 2004 324156
## 5 Numenor 2005 325416
## 6 Westeros 2001 314256
## 7 Westeros 2002 432165
## 8 Westeros 2003 546123
## 9 Westeros 2004 465321
## 10 Westeros 2005 561423
## # . . . with 15 more rows
You can use more than one range:
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
cols =c( '2001':'2002', '2004':'2005')
)
but the
range specification
is only useful when the relevant columns are contiguous. You can also specify the complement of the columns you want to merge using !. For example, if you want to merge the years, you can pick all columns except for
country rather than pick all the years:
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
cols =!country
)
## # A tibble: 25 × 3
## country year income
## <chr> <chr> <dbl>
## 1 Numenor 2001 123456
## 2 Numenor 2002 132654
## 3 Numenor 2003 321646
## 4 Numenor 2004 324156
## 5 Numenor 2005 325416
## 6 Westeros 2001 314256
## 7 Westeros 2002 432165
## 8 Westeros 2003 546123
## 9 Westeros 2004 465321
## 10 Westeros 2005 561423
## # . . . with 15 more rows
You can specify more than one column in a complement, and you can combine complements with ranges.
Generally, you can combine
pivot_longer() with all the tidy selection functions to select columns, for example, you can use
starts_with() or combine
starts_with() and complement,
!, or anything else the
tidy selection language
allows.
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
starts_with("2")
)
## # A tibble: 25 × 3
## country year income
## <chr> <chr> <dbl>
## 1 Numenor 2001 123456
## 2 Numenor 2002 132654
## 3 Numenor 2003 321646
## 4 Numenor 2004 324156
## 5 Numenor 2005 325416
## 6 Westeros 2001 314256
## 7 Westeros 2002 432165
## 8 Westeros 2003 546123
## 9 Westeros 2004 465321
## 10 Westeros 2005 561423
## # . . . with 15 more rows
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
!starts_with("c")
)
## # A tibble: 25 × 3
## country year income
## <chr> <chr> <dbl>
## 1 Numenor 2001 123456
## 2 Numenor 2002 132654
## 3 Numenor 2003 321646
## 4 Numenor 2004 324156
## 5 Numenor 2005 325416
## 6 Westeros 2001 314256
## 7 Westeros 2002 432165
## 8 Westeros 2003 546123
## 9 Westeros 2004 465321
## 10 Westeros 2005 561423
## # . . . with 15 more rows
The column names that go into the
names_to column will be
strings
. If you need another type, such as logical or numeric values, you can achieve that by setting the
names_transform argument. If we want all the years to be represented as integers, we could do
this:
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
!country,
names_transform =as.integer
)
## # A tibble: 25 × 3
## country year income
## <chr> <int> <dbl>
## 1 Numenor 2001 123456
## 2 Numenor 2002 132654
## 3 Numenor 2003 321646
## 4 Numenor 2004 324156
## 5 Numenor 2005 325416
## 6 Westeros 2001 314256
## 7 Westeros 2002 432165
## 8 Westeros 2003 546123
## 9 Westeros 2004 465321
## 10 Westeros 2005 561423
## # . . . with 15 more rows
Because all the years are integers, the year column now contains integers
. If we want to analyze this data set, we most likely want to interpret the years as integers.
If we wanted factors instead, though, we could also do that:
mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
!country,
names_transform =as.factor
)
There is a similar values_transform argument if you need to transform the values that go into the values_to column.
If you work with functions that do not expect tidy data and rather expect to find related data in different columns, you can translate tidy data into that format using the
pivot_wider() function
. It is the reverse of
pivot_longer().
2
tidy_income<-mean_income|>pivot_longer(
names_to ="year",
values_to ="income",
!country
)
tidy_income|>pivot_wider(
names_from ="year",
values_from ="income"
)
## # A tibble: 5 × 6
## country `2001` `2002` `2003` `2004` `2005`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Numenor 123456 132654 321646 324156 325416
## 2 Westeros 314256 432165 546123 465321 561423
## 3 Narnia 432156 342165 564123 543216 465321
## 4 Gondor 531426 321465 235461 463521 561423
## 5 Laputa 14235 34125 45123 51234 54321
The names_from and values_from arguments pick the values to use as column names and the values to put in those columns, respectively. They are the same arguments that you would use for pivot_longer()’s names_to and values_to arguments to translate the result back again.
Complex Column Encodings
Different variables are not always represented in separated columns. For example, it is not uncommon to have a column that contains a date, but that is really a day and a month and possibly a year. The best representation of a date is, of course, a date object, but for the sake of the example, let us say that we want to split a date into a day and a month column. You can do this using the
separate() function
.
tbl<-tribble(
~date,
"11/5",
"4/7",
"21/12"
)
separate(tbl, date,into =c("day","month"))
## # A tibble: 3 × 2
## day month
## <chr> <chr>
## 1 11 5
## 2 4 7
## 3 21 12
The first argument is the data table, the second the column you want to split, and the into argument is a list of the columns the original date will be put into.
By default, the original column is removed, but if you want to keep it, you can use the
remove argument
.
separate(tbl, date,into =c("day","month"),remove =FALSE)
## # A tibble: 3 × 3
## date day month
## <chr> <chr> <chr>
## 1 11/5 11 5
## 2 4/7 4 7
## 3 21/12 21 12
After separating the fields in the
data column, which is a column of strings, you get a
day and a
month column, and these also contain
strings
. To convert them into numbers, you can use the
convert argument
.
separate(tbl, date,into =c("day","month"),convert =TRUE)
## # A tibble: 3 × 2
## day month
## <int> <int>
## 1 11 5
## 2 4 7
## 3 21 12
A column will be split by default on any non-alphanumeric character. This is why the / in the data is correctly used as the separator between the day and the month.
3 If your separator is something else, you can use the
sep argument. The argument takes a regular expression, and the column will be split on this. For example, we can split on
alphabet characters and spaces
like this:
tbl<-tribble(
~date,
"11th of month 5",
"4th of month 7",
"21st of month 12"
)
separate(tbl, date,into =c("day","month"),
sep ="[[:alpha:][:space:]]+")
## # A tibble: 3 × 2
## day month
## <chr> <chr>
## 1 11 5
## 2 4 7
## 3 21 12
The regular expression
([[:alpha:]]|[[:space:]])+ specifies any nonzero (specified by +) sequence of the character classes in (the outermost) square brackets, where we have put alphabet characters [:alpha:] and spaces [:space:].
The date format in this example is not something you would find in the wild, but if we have the date in this format
tbl<-tribble(
~date,
"11th of May",
"4th of July",
"21st of December"
)
we cannot easily distinguish the separator string
from the months. If the delimiter you need is not a single regular expression, you might be able to use the extract() function
instead of separate(). With extract(), you can use regular expression groups to identify the fields to extract.
You need an expression with groups—those are subexpressions in parentheses—and what those groups match will be the value put in the new columns. For our dates, it could look like this:
tbl|>extract(
col =date,
into =c("day","month"),
regex =paste0(
"([[:digit:]]+)", # First group, the day
"[[:alpha:][:space:]]+", # Stuff we ignore; will match "th of" and such
"[[:space:]]", # The final space before the month name
"([[:alpha:]]+)" # Second group, the month
)
)
## # A tibble: 3 × 1
## date
## <chr>
## 1 11th of May
## 2 4th of July
## 3 21st of December
For both
separate() and
extract(), you can have more than two resulting columns. You can provide more than two names to the
into parameter, and then the expression you use for separating and extracting should provide the right number of values, of course.
tbl|>extract(
col =date,
into =c("day","fluf","month"),
regex =paste0(
"([[:digit:]]+)", # First group, the day
"([[:alpha:]]+)", # st, nd, rd, th, that kind of stuff goes in fluf
"[[:alpha:][:space:]]+",# stuff between day and month; not a group so we throw it away
"([[:alpha:]]+)" # Third group, the month
)
)
## # A tibble: 3 × 1
## date
## <chr>
## 1 11th of May
## 2 4th of July
## 3 21st of December
The reverse of separating is uniting. For example, let
tbl2 contain a
day and a
month column. I’m building such a table in the following code:
tbl<-tribble(
~date,
"11/5",
"4/7",
"21/12"
)
tbl2<-tbl|>separate(col =date,into =c("day","month"))
Having created such a tibble in tbl2, we can use the
unite() function
to undo
separate() much as we found
pivot_longer/wider() to be able to undo each other (with the caveats that things can get lost of course). The col argument should be the column we create and the following columns those we create it from.
tbl2|>unite(col ="date", day, month)
## # A tibble: 3 × 1
## date
## <chr>
## 1 11_5
## 2 4_7
## 3 21_12
By default, the separator in the new column will be
_. This is not what we want for dates, but we can use the
sep argument to change it.
tbl2|>unite(col ="date", day, month,sep ="/")
## # A tibble: 3 × 1
## date
## <chr>
## 1 11/5
## 2 4/7
## 3 21/12
If we want to keep the original columns around, we can set the
remove argument
to
FALSE.
tbl2|>unite(
col ="date", day, month,
sep ="/",remove =FALSE
)
## # A tibble: 3 × 3
## date day month
## <chr> <chr> <chr>
## 1 11/5 11 5
## 2 4/7 4 7
## 3 21/12 21 12
Columns that contain more than one value do not always contain the same number of values. For example, we could have data such as the number of casualties per major group in WW1 and WW2.
military_casualties<-tribble(
~war,~groups,~deaths,
'WW1',
"Allied Powers/Central Powers",
"5.7,4.0",
'WW2',
"Germany/Japan/USSR/British Empire/USA",
"5.3,2.1,10.7,0.6,0.4"
)
The groupings were not the same in the two wars, so we cannot split the data into different columns. We can, however, divide it into more rows using
separate_rows()
.
military_casualties|>separate_rows(
groups, deaths,
sep ="/|,"
)
## # A tibble: 7 × 3
## war groups deaths
## <chr> <chr> <chr>
## 1 WW1 Allied Powers 5.7
## 2 WW1 Central Powers 4.0
## 3 WW2 Germany 5.3
## 4 WW2 Japan 2.1
## 5 WW2 USSR 10.7
## 6 WW2 British Empire 0.6
## 7 WW2 USA 0.4
The expression we use for sep says we should split on / or , (where the | in the string is regular expression speak for “or”).
We might not want the deaths here to be strings, but with
convert = TRUE we can turn them into numbers:
military_casualties|>separate_rows(
groups, deaths,
sep ="/|,",
convert =TRUE
)
## # A tibble: 7 × 3
## war groups deaths
## <chr> <chr> <dbl>
## 1 WW1 Allied Powers 5.7
## 2 WW1 Central Powers 4
## 3 WW2 Germany 5.3
## 4 WW2 Japan 2.1
## 5 WW2 USSR 10.7
## 6 WW2 British Empire 0.6
## 7 WW2 USA 0.4
Expanding, Crossing, and Completing
For some applications, it is useful to create all combinations of values from two or more columns—even those combinations that are missing from the data. For this, you can use the function
expand()
:
tbl<-tribble(
~A,~B,~C,
1,11,21,
2,11,22,
4,13,32
)
tbl|>expand(A, B)
## # A tibble: 6 × 2
## A B
## <dbl> <dbl>
## 1 1 11
## 2 1 13
## 3 2 11
## 4 2 13
## 5 4 11
## 6 4 13
Here, we get all combinations of values from columns A and B while column C is ignored. If a column contains values from a larger set than what is found in the data, you can give
expand()
a vector of values:
tbl |> expand(A = 1:4, B)
## # A tibble: 8 × 2
## A B
## <int> <dbl>
## 1 1 11
## 2 1 13
## 3 2 11
## 4 2 13
## 5 3 11
## 6 3 13
## 7 4 11
## 8 4 13
If you have vectors of values you want to combine this way, you can also use the
crossing() function
.
crossing(A = 1:3, B = 11:13)
## # A tibble: 9 × 2
## A B
## <int> <int>
## 1 1 11
## 2 1 12
## 3 1 13
## 4 2 11
## 5 2 12
## 6 2 13
## 7 3 11
## 8 3 12
## 9 3 13
If you only want the combinations found in the data, you can combine
expand() with the
nesting() function
:
tbl <- tribble(
~A, ~B, ~C,
1, 11, 21,
2, 11, 22,
2, 11, 12,
4, 13, 42,
4, 13, 32
)
expand(tbl, nesting(A, B))
## # A tibble: 3 × 2
## A B
## <dbl> <dbl>
## 1 1 11
## 2 2 11
## 3 4 13
This combination gives you all the combinations of the specified columns—all other columns are ignored—and only the unique combinations, unlike if you extracted the columns by indexing. This is useful if you have data with lots of duplications, but you only need to compute some statistics for the unique combinations.
If you need all combinations in a data frame, but accept missing values for those that are not present, you can use the
complete() function. It makes sure you have all combinations and put missing values in the remaining columns for the combinations that are not in the original data.
complete(tbl, A = 1:4)
## # A tibble: 6 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 1 11 21
## 2 2 11 22
## 3 2 11 12
## 4 3 NA NA
## 5 4 13 42
## 6 4 13 32
## # A tibble: 6 × 3
## B A C
## <dbl> <dbl> <dbl>
## 1 11 1 21
## 2 11 2 22
## 3 11 2 12
## 4 12 NA NA
## 5 13 4 42
## 6 13 4 32
complete(tbl, A = 1:4, B = 11:13)
## # A tibble: 14 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 1 11 21
## 2 1 12 NA
## 3 1 13 NA
## 4 2 11 22
## 5 2 11 12
## 6 2 12 NA
## 7 2 13 NA
## 8 3 11 NA
## 9 3 12 NA
## 10 3 13 NA
## 11 4 11 NA
## 12 4 12 NA
## 13 4 13 42
## 14 4 13 32
Missing Values
When data has missing values
, it often requires application domain knowledge to deal with it correctly. The tidyr package has some rudimentary support for cleaning data with missing values; you might need more features to deal with missing values properly, and you are likely to find those in the dplyr package (see Chapter 8).
The simplest way to handle missing values is to get rid of it. A crude approach is to remove all observations with one or more missing variable values. The
drop_na() function does exactly that.
mean_income <- tribble(
~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,
"Numenor", 123456, 132654, NA, 324156,
"Westeros", 314256, NA, NA, 465321,
"Narnia", 432156, NA, NA, NA,
"Gondor", 531426, 321465, 235461, 463521,
"Laputa", 14235, 34125, 45123, 51234,
)
drop_na(mean_income)
## # A tibble: 2 × 5
## country `2002` `2003` `2004` `2005`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gondor 531426 321465 235461 463521
## 2 Laputa 14235 34125 45123 51234
In this example, we lose
a country entirely if we have missing data for a single year. This is unlikely to be acceptable and is another good argument for tidy data. If we reformat the table and drop missing values, we will only remove the observations where the income is missing for a given year.
mean_income |> pivot_longer(
names_to = "year",
values_to = "mean_income",
-country
) |> drop_na()
## # A tibble: 14 × 3
## country year mean_income
## <chr> <chr> <dbl>
## 1 Numenor 2002 123456
## 2 Numenor 2003 132654
## 3 Numenor 2005 324156
## 4 Westeros 2002 314256
## 5 Westeros 2005 465321
## 6 Narnia 2002 432156
## 7 Gondor 2002 531426
## 8 Gondor 2003 321465
## 9 Gondor 2004 235461
## 10 Gondor 2005 463521
## 11 Laputa 2002 14235
## 12 Laputa 2003 34125
## 13 Laputa 2004 45123
## 14 Laputa 2005 51234
Sometimes, you can replace missing data with some appropriate data. For example, we could replace missing data in the
mean_income table
by setting each value to the mean of its column. The
replace_na() function does this. It takes a data frame as its first argument and a list specifying replacements as its second argument. This list should have a name for each column you want to replace missing values in, and the value assigned to these names should be the values with which you wish to replace.
replace_na(mean_income, list(
`2003` = mean(mean_income$`2003`, na.rm = TRUE)
))
## # A tibble: 5 × 5
## country `2002` `2003` `2004` `2005`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Numenor 123456 132654 NA 324156
## 2 Westeros 314256 162748 NA 465321
## 3 Narnia 432156 162748 NA NA
## 4 Gondor 531426 321465 235461 463521
## 5 Laputa 14235 34125 45123 51234
replace_na(mean_income, list(
`2003` = mean(mean_income$`2003`, na.rm = TRUE),
`2004` = mean(mean_income$`2004`, na.rm = TRUE),
`2005` = mean(mean_income$`2005`, na.rm = TRUE)
))
## # A tibble: 5 × 5
## country `2002` `2003` `2004` `2005`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Numenor 123456 132654 140292 324156
## 2 Westeros 314256 162748 140292 465321
## 3 Narnia 432156 162748 140292 326058
## 4 Gondor 531426 321465 235461 463521
## 5 Laputa 14235 34125 45123 51234
In this example, using the mean per year is unlikely to be useful; it would make more sense to replace missing data with the mean for each country. This is not immediately possible with tidyr functions, though. With dplyr, we have the tools for it, and I return to it in Chapter 8.
A final function for managing missing data is the
fill() function
. It replaces missing values with the value above them in their column (or below them if you set the argument
.direction to "
up". Imagine that we have a table of income per quarter of each year, but the year is only mentioned for the first quarter.
tbl <- read_csv(
"year, quarter, income
2011, Q1, 13
, Q2, 12
, Q3, 14
, Q4, 11
2012, Q1, 12
, Q2, 14
, Q3, 15
, Q4, 17",
show_col_types = FALSE
)
spec(tbl)
## cols(
## year = col_double(),
## quarter = col_character(),
## income = col_double()
## )
We can repeat the years downward through the quarters with
fill():
## # A tibble: 8 × 3
## year quarter income
## <dbl> <chr> <dbl>
## 1 2011 Q1 13
## 2 2011 Q2 12
## 3 2011 Q3 14
## 4 2011 Q4 11
## 5 2012 Q1 12
## 6 2012 Q2 14
## 7 2012 Q3 15
## 8 2012 Q4 17
Nesting Data
A final functionality that tidyr provides is nesting data
. A tibble entry usually contains simple data, such as numbers or strings
, but it can also hold complex data, such as other tables.
Consider this table:
tbl<-tribble(
~A,~B,~C,
1,11,21,
1,11,12,
2,42,22,
2,15,22,
2,15,32,
4,13,32
)
From this table, we can create one that, for each value in the A column, contains a table of B and C for that A value.
nested_tbl <- tbl |> nest(BC = c(B, C))
nested_tbl
## # A tibble: 3 × 2
## A BC
## <dbl> <list>
## 1 1 <tibble [2 × 2]>
## 2 2 <tibble [3 × 2]>
## 3 4 <tibble [1 × 2]>
## [[1]]
## # A tibble: 2 × 2
## B C
## <dbl> <dbl>
## 1 11 21
## 2 11 12
The column that contains the nested table is named BC because we used a named argument with that name, BC = c(B, C).
The reverse operation is
unnest():
nested_tbl |> unnest(cols = c(BC))
## # A tibble: 6 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 1 11 21
## 2 1 11 12
## 3 2 42 22
## 4 2 15 22
## 5 2 15 32
## 6 4 13 32
Nesting is not a typical operation in data analysis, but it is there if you need it, and in the packages in Chapter 12, we see some usages.