© Thomas Mailund 2019
Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_4

4. Reformatting Tables: tidyr

Thomas Mailund1 
(1)
Aarhus, Denmark
 

Even if we only consider tabular data, there are still many different ways to format this data. The packages in the Tidyverse expect that data is represented as so-called “tidy data”1 (which is where the name Tidyverse comes from). The tidyr package helps you with formatting your data into tidy data.

You can load tidyr as part of the Tidyverse
library(tidyverse)
or on its own
library(tidyr)

Tidy Data

The fundamental properties that characterize tidy data are that each variable is in a column, and each observation is in a row. The terms like variables and observations should be familiar from statistics, but what the tidy data properties say is that you should not put the values of a variable in different columns and you should not put more than one observation in the same row.

Consider data such as this:
mean_income <- tribble(
    ~country,  ~`2001`,  ~`2002`,  ~`2003`,  ~`2004`,  ~`2005`,
    "Numenor",  123456,   132654,   321646,   324156,   325416,
    "Westeros", 314256,   432165,   546123,   465321,   561423,
    "Narnia",   432156,   342165,   564123,   543216,   465321,
    "Gondor",   531426,   321465,   235461,   463521,   561423,
    "Laputa",    14235,    34125,    45123,    51234,    54321
)

We could imagine that this table contains the mean income in five (fictional) countries in the first 5 years of the twenty-first century. Let us say that you are interested in knowing if there is a pattern in income over time when taking the different countries into account. To do this, you might consider a formula such as income ~ country + year, but the data is not formatted in a way that makes this easy. If year is a variable that you can use in the model, it is a problem that different observations, the actual years, are split into different columns. Similarly, if the mean income per year is a variable, it is a problem that all the observations for a country are in the same row. You cannot use income ~ country + year because the data is not tidy.

Gather and Spread

The function gather() can merge several columns into two—a key and a value column. The key column will be a discrete variable with values taken from the column names, and the value column will contain the data from the original columns.

Consider the mean income data frame. We want the actual years to be different outcomes of a year variable and have the income for each year to be another variable, income. We can get such a tibble by calling gather() with these parameters:
gather(
    data = mean_income,
    key = "year",
    value = "income",
    `2001`, `2001`, `2002`,
    `2003`, `2004`, `2005`
)
## # A tibble: 25 x 3
##    country  year  income
##    <chr>    <chr>  <dbl>
##  1 Numenor  2001  123456
##  2 Westeros 2001  314256
##  3 Narnia   2001  432156
##  4 Gondor   2001  531426
##  5 Laputa   2001   14235
##  6 Numenor  2002  132654
##  7 Westeros 2002  432165
##  8 Narnia   2002  342165
##  9 Gondor   2002  321465
## 10 Laputa   2002   34125
## # ...  with 15 more rows

The data argument is the input data, and the key and value arguments are the names of two new columns. In the key column, we get the names of the columns we gather, and in the value column, we get the matching values from those columns. The remaining parameters specify which columns we wish to gather.

In the preceding example, I explicitly listed all the columns to gather. You can also use a range of columns, like this:
gather(
    data = mean_income,
    key = "year",
    value = "income",
    `2001`:`2005`
)
You can use more than one range:
gather(
    data = mean_income,
    key = "year",
    value = "income",
    `2001`:`2002`,
    `2004`:`2005`
)
but the range specification is only useful when the relevant columns are contiguous. You can also specify the complement of the columns you want to gather using minus. For example, if you want to gather the years, you can pick all columns except for country rather than pick all the years:
gather(
    data = mean_income,
    key = "year",
    value = "income",
    -country
)

You can specify more than one column in a complement, and you can combine complements with ranges.

Finally, you can combine gather() with tidy selection (see Chapter 3) to select columns.
gather(
    data = mean_income,
    key = "year",
    value = "income",
    tidyselect::starts_with("2")
)
gather(
    data = mean_income,
    key = "year",
    value = "income",
    -tidyselect::starts_with("c")
)
The column names that go into the key column will be strings. If they can be translated into logical or numeric values, you can do so by setting the argument convert to TRUE .
gather(
    data = mean_income,
    key = "year",
    value = "income",
    -country,
    convert = TRUE
)
## # A tibble: 25 x 3
##    country   year income
##    <chr>    <int>  <dbl>
##  1 Numenor   2001 123456
##  2 Westeros  2001 314256
##  3 Narnia    2001 432156
##  4 Gondor    2001 531426
##  5 Laputa    2001  14235
##  6 Numenor   2002 132654
##  7 Westeros  2002 432165
##  8 Narnia    2002 342165
##  9 Gondor    2002 321465
## 10 Laputa    2002  34125
## # ...  with 15 more rows

Because all the years are integers, the year column now contains integers. If we want to analyze this data set, we most likely want to interpret the years as integers.

If you want to translate strings into factors, you have to do this explicitly after the transformation.

If you work with functions that do not expect tidy data and rather expect to find related data in different columns, you can translate tidy data into that format using the spread() function . It is the reverse of gather().2
tidy_income <- gather(
    data = mean_income,
    key = "year",
    value = "income",
    -country,
    convert = TRUE
)
spread(
    data = tidy_income,
    key = "year",
    value = "income"
)
## # A tibble: 5 x 6
##   country  `2001` `2002` `2003` `2004` `2005`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Gondor   531426 321465 235461 463521 561423
## 2 Laputa    14235  34125  45123  51234  54321
## 3 Narnia   432156 342165 564123 543216 465321
## 4 Numenor  123456 132654 321646 324156 325416
## 5 Westeros 314256 432165 546123 465321 561423

The key and value arguments pick the values to use as column names and the values to put in those columns, respectively. They are the same arguments that you would use for gather()’s key and value arguments to translate the result back again.

Complex Column Encodings

Different variables are not always represented in separated columns. For example, it is not uncommon to have a column that contains a date, but that is really a day and a month and possibly a year. The best representation of a date is, of course, a date object, but for the sake of the example, let us say that we want to split a date into a day and a month column. You can do this using the separate() function .
tbl <- tribble(
    ~date,
    "11/5",
    "4/7",
    "21/12"
)
separate(tbl, date, into = c("day", "month"))
## # A tibble: 3 x 2
##   day   month
##   <chr> <chr>
## 1 11    5
## 2 4     7
## 3 21    12

The first argument is the data table, the second the column you want to split, and the into argument is a list of the columns the original date will be put into.

By default, the original column is removed, but if you want to keep it, you can use the remove argument .
separate(tbl, date, into = c("day", "month"), remove = FALSE)
## # A tibble: 3 x 3
##   date  day   month
##   <chr> <chr> <chr>
## 1 11/5  11    5
## 2 4/7   4     7
## 3 21/12 21    12
After separating the fields in the data column, which is a column of strings, you get a day and a month column, and these also contain strings. To convert them into numbers, you can use the convert argument .
separate(tbl, date, into = c("day", "month"), convert = TRUE)
## # A tibble: 3 x 2
##     day month
##   <int> <int>
## 1    11     5
## 2     4     7
## 3    21    12
By default a column will be split on any non-alphanumeric character. This is why / in the data is correctly used as the separator between the day and the month. If your separator is something else, you can use the sep argument . The argument takes a regular expression, and the column will be split on this. For example, we can split on alphabet characters and spaces like this:
tbl <- tribble(
    ~date,
    "11th of month 5",
    "4th of month 7",
    "21st of month 12"
)
separate(tbl, date, into = c("day", "month"),
         sep = "([:alpha:]|[:space:])+")
## # A tibble: 3 x 2
##   day   month
##   <chr> <chr>
## 1 11    5
## 2 4     7
## 3 21    12
The date format in this example is not something you would find in the wild, but if we have the date in this format
tbl <- tribble(
    ~date,
    "11th of May",
    "4th of July",
    "21st of December"
)
we cannot easily distinguish the separator string from the months. If the delimiter you need is not a single regular expression, you might be able to use the extract() function instead of separate(). With extract() you can use regular expression groups to identify the fields to extract.
regexp = "([:digit:]+)"
         "(?:[:alpha:]|[:space:])+"
         "[:space:]([:alpha:]+)"
extract(
    tbl, date,
    into = c("day", "month"),
    regex = regexp
)
## # A tibble: 3 x 2
##   day   month
##   <chr> <chr>
## 1 11    May
## 2 4     July
## 3 21    December
The reverse of separating is uniting. For example, let tbl2 contain a day and a month column.
tbl <- tribble(
    ~date,
    "11/5",
    "4/7",
    "21/12"
)
tbl2 <- separate(tbl, date, into = c("day", "month"))
To create a date column from the day and the month, we use unite(). The col argument should be the column we create and the following columns those we create it from.
unite(tbl2, col = "date", day, month)
## # A tibble: 3 x 1
##   date
##   <chr>
## 1 11_5
## 2 4_7
## 3 21_12
By default, the separator in the new column will be _. This is not what we want for dates, but we can use the sep argument to change it.
unite(tbl2, col = "date", day, month, sep = "/")
## # A tibble: 3 x 1
##   date
##   <chr>
## 1 11/5
## 2 4/7
## 3 21/12
If we want to keep the original columns around, we can set the remove argument to FALSE.
unite(
   tbl2,
   col = "date", day, month,
   sep = "/", remove = FALSE
)
## # A tibble: 3 x 3
##   date  day   month
##   <chr> <chr> <chr>
## 1 11/5  11    5
## 2 4/7   4     7
## 3 21/12 21    12
Columns that contain more than one value do not always contain the same number of values. For example, we could have data such as the number of casualties per major group in WW1 and WW2.
military_casualties <- tribble(
    ~war, ~groups, ~deaths,
    'WW1',
    "Allied Powers/Central Powers",
    "5.7,4.0",
    'WW2',
    "Germany/Japan/USSR/British Empire/USA",
    "5.3,2.1,10.7,0.6,0.4"
)
The groupings were not the same in the two wars, so we cannot split the data into different columns. We can, however, divide it into more rows using separate_rows() .
separate_rows(
    military_casualties,
    groups, deaths,
    sep = "/|,"
)
## # A tibble: 7 x 3
##   war   groups         deaths
##   <chr> <chr>          <chr>
## 1 WW1   Allied Powers  5.7
## 2 WW1   Central Powers 4.0
## 3 WW2   Germany        5.3
## 4 WW2   Japan          2.1
## 5 WW2   USSR           10.7
## 6 WW2   British Empire 0.6
## 7 WW2   USA            0.4

Expanding, Crossing, and Completing

For some applications, it is useful to create all combinations of values from two or more columns—even those combinations that are missing from the data. For this, you can use the function expand() :
tbl <- tribble(
    ~A, ~B, ~C,
     1, 11, 21,
     2, 11, 22,
     4, 13, 32
)
expand(tbl, A, B)
## # A tibble: 6 x 2
##       A     B
##   <dbl> <dbl>
## 1     1    11
## 2     1    13
## 3     2    11
## 4     2    13
## 5     4    11
## 6     4    13
Here, we get all combinations of values from columns A and B, while column C is ignored. If a column contains values from a larger set than what is found in the data, you can give expand() a vector of values:
expand(tbl, A = 1:4, B)
## # A tibble: 8 x 2
##       A     B
##   <int> <dbl>
## 1     1    11
## 2     1    13
## 3     2    11
## 4     2    13
## 5     3    11
## 6     3    13
## 7     4    11
## 8     4    13
If you have vectors of values you want to combine this way, you can also use the crossing() function .
crossing(A = 1:3, B = 11:13)
## # A tibble: 9 x 2
##       A     B
##   <int> <int>
## 1     1    11
## 2     1    12
## 3     1    13
## 4     2    11
## 5     2    12
## 6     2    13
## 7     3    11
## 8     3    12
## 9     3    13
If you only want the combinations found in the data, you can combine expand() with the nesting() function :
tbl <- tribble(
    ~A, ~B, ~C,
    1, 11, 21,
    2, 11, 22,
    2, 11, 12,
    4, 13, 42,
    4, 13, 32
)
expand(tbl, nesting(A, B))
## # A tibble: 3 x 2
##       A     B
##   <dbl> <dbl>
## 1     1    11
## 2     2    11
## 3     4    13

This combination gives you all the combinations of the specified columns—all other columns are ignored—and only the unique combinations, unlike if you extracted the columns by indexing.

If you need all combinations in a data frame, but accept missing values for those that are not present, you can use the complete() function . It makes sure you have all combinations and put missing values in the remaining columns for the combinations that are not in the original data.
complete(tbl, A = 1:4)
## # A tibble: 6 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1    11    21
## 2     2    11    22
## 3     2    11    12
## 4     3    NA    NA
## 5     4    13    42
## 6     4    13    32
complete(tbl, B = 11:13)
## # A tibble: 6 x 3
##       B     A     C
##   <dbl> <dbl> <dbl>
## 1    11     1    21
## 2    11     2    22
## 3    11     2    12
## 4    12    NA    NA
## 5    13     4    42
## 6    13     4    32
complete(tbl, A = 1:4, B = 11:13)
## # A tibble: 14 x 3
##        A     B     C
##    <dbl> <dbl> <dbl>
##  1     1    11    21
##  2     1    12    NA
##  3     1    13    NA
##  4     2    11    22
##  5     2    11    12
##  6     2    12    NA
##  7     2    13    NA
##  8     3    11    NA
##  9     3    12    NA
## 10     3    13    NA
## 11     4    11    NA
## 12     4    12    NA
## 13     4    13    42
## 14     4    13    32

Missing Values

When data has missing values, it often requires application domain knowledge to deal with it correctly. The tidyr package has some rudimentary support for cleaning data with missing values; you might need more features to deal with missing values properly, and you are likely to find those in the dplyr package; see Chapter 7.

The simplest way to handle missing values is to get rid of it. A crude approach is to remove all observations with one or more missing variable values. The drop_na() function does exactly that.
mean_income <- tribble(
    ~country,  ~`2002`,  ~`2003`,  ~`2004`,  ~`2005`,
    "Numenor",  123456,   132654,       NA,   324156,
    "Westeros", 314256,   NA,           NA,   465321,
    "Narnia",   432156,   NA,           NA,       NA,
    "Gondor",   531426,   321465,   235461,   463521,
    "Laputa",    14235,    34125,    45123,    51234,
)
drop_na(mean_income)
## # A tibble: 2 x 5
##   country `2002` `2003` `2004` `2005`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
## 1 Gondor  531426 321465 235461 463521
## 2 Laputa   14235  34125  45123  51234
In this example, we lose a country entirely if we have missing data for a single year. This is unlikely to be acceptable and is another good argument for tidy data. If we reformat the table and drop missing values, we will only remove the observations where the income is missing for a given year.
tidy_mean_income <- gather(
    data = mean_income,
    key = "year",
    value = "mean_income",
    -country
)
drop_na(tidy_mean_income)
## # A tibble: 14 x 3
##    country  year  mean_income
##    <chr>    <chr>       <dbl>
##  1 Numenor  2002       123456
##  2 Westeros 2002       314256
##  3 Narnia   2002       432156
##  4 Gondor   2002       531426
##  5 Laputa   2002        14235
##  6 Numenor  2003       132654
##  7 Gondor   2003       321465
##  8 Laputa   2003        34125
##  9 Gondor   2004       235461
## 10 Laputa   2004        45123
## 11 Numenor  2005       324156
## 12 Westeros 2005       465321
## 13 Gondor   2005       463521
## 14 Laputa   2005        51234
Sometimes, you can replace missing data with some appropriate data. For example, we could replace missing data in the mean_income table by setting each value to the mean of its column. The replace_na() function does this. It takes a data frame as its first argument and a list specifying replacements as its second argument. This list should have a name for each column you want to replace missing values in, and the value assigned to these names should be the values with which you wish you replace.
replace_na(mean_income, list(
    `2003` = mean(mean_income$`2003`, na.rm = TRUE)
))
## # A tibble: 5 x 5
##   country  `2002` `2003` `2004` `2005`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>
## 1 Numenor  123456 132654     NA 324156
## 2 Westeros 314256 162748     NA 465321
## 3 Narnia   432156 162748     NA     NA
## 4 Gondor   531426 321465 235461 463521
## 5 Laputa    14235  34125  45123  51234
replace_na(mean_income, list(
    `2003` = mean(mean_income$`2003`, na.rm = TRUE),
    `2004` = mean(mean_income$`2004`, na.rm = TRUE),
    `2005` = mean(mean_income$`2005`, na.rm = TRUE)
))
## # A tibble: 5 x 5
##   country  `2002` `2003` `2004` `2005`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>
## 1 Numenor  123456 132654 140292 324156
## 2 Westeros 314256 162748 140292 465321
## 3 Narnia   432156 162748 140292 326058
## 4 Gondor   531426 321465 235461 463521
## 5 Laputa    14235  34125  45123  51234

In this example, using the mean per year is unlikely to be useful; it would make more sense to replace missing data with the mean for each country. This is not immediately possible with tidyr functions, though. With dplyr we have the tools for it and I return to it in Chapter 7.

A final function for managing missing data is the fill() function . It replaces missing values with the value above them in their column (or below them if you set the argument .direction to "up". Imagine that we have a table of income per quarter of each year, but the year is only mentioned for the first quarter.
tbl <- read_csv(
    "year, quarter, income
    2011, Q1, 13
    , Q2, 12
    , Q3, 14
    , Q4, 11
    2012, Q1, 12
    , Q2, 14
    , Q3, 15
    , Q4, 17"
)
We can repeat the years downward through the quarters with fill() :
fill(tbl, year)
## # A tibble: 8 x 3
##    year quarter income
##   <dbl> <chr>    <dbl>
## 1  2011 Q1          13
## 2  2011 Q2          12
## 3  2011 Q3          14
## 4  2011 Q4          11
## 5  2012 Q1          12
## 6  2012 Q2          14
## 7  2012 Q3          15
## 8  2012 Q4          17

Nesting Data

A final functionality that tidyr provides is nesting data. A tibble entry usually contains simple data, such as numbers or strings, but it can also hold complex data, such as other tables.

Consider this table:
tbl <- tribble(
    ~A, ~B, ~C,
    1, 11, 21,
    1, 11, 12,
    2, 42, 22,
    2, 15, 22,
    2, 15, 32,
    4, 13, 32
)
From this table, we can create one that, for each value in the A column, contains a table of B and C for that A value.
nested_tbl <- nest(tbl, B, C)
nested_tbl
## # A tibble: 3 x 2
##       A data
##   <dbl> <list>
## 1     1 <tibble [2 × 2]>
## 2     2 <tibble [3 × 2]>
## 3     4 <tibble [1 × 2]>
nested_tbl[[1,2]]
## # A tibble: 2 x 2
##       B     C
##   <dbl> <dbl>
## 1    11    21
## 2    11    12
The column that contains the nested table is named data, but you can change this using the .key argument .
nested_tbl <- nest(tbl, B, C, .key = "BC")
nested_tbl
## # A tibble: 3 x 2
##       A BC
##   <dbl> <list>
## 1     1 <tibble [2 × 2]>
## 2     2 <tibble [3 × 2]>
## 3     4 <tibble [1 × 2]>
nested_tbl$BC
## [[1]]
## # A tibble: 2 x 2
##       B     C
##   <dbl> <dbl>
## 1    11    21
## 2    11    12
##
## [[2]]
## # A tibble: 3 x 2
##       B     C
##   <dbl> <dbl>
## 1    42    22
## 2    15    22
## 3    15    32
##
## [[3]]
## # A tibble: 1 x 2
##       B     C
##   <dbl> <dbl>
## 1    13    32
The arguments after the table pick the columns to be nested. You can use any number of columns here, including no columns at all. In the latter case, you will nest the entire original table in the data column .
nest(tbl, B)
## # A tibble: 5 x 3
##       A     C data
##   <dbl> <dbl> <list>
## 1     1    21 <tibble [1 × 1]>
## 2     1    12 <tibble [1 × 1]>
## 3     2    22 <tibble [2 × 1]>
## 4     2    32 <tibble [1 × 1]>
## 5     4    32 <tibble [1 × 1]>
nest(tbl)
## # A tibble: 1 x 1
##   data
##   <list>
## 1 <tibble [6 × 3]>
The reverse operation is unnest() :
nested_tbl
## # A tibble: 3 x 2
##       A BC
##   <dbl> <list>
## 1     1 <tibble [2 × 2]>
## 2     2 <tibble [3 × 2]>
## 3     4 <tibble [1 × 2]>
unnest(nested_tbl)
## # A tibble: 6 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1    11    21
## 2     1    11    12
## 3     2    42    22
## 4     2    15    22
## 5     2    15    32
## 6     4    13    32

Nesting is not a typical operation in data analysis, but it is there if you need it, and in the packages in Chapter 11 we see some usages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.161.165