© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. MailundR 4 Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-8780-4_5

5. Reformatting Tables: tidyr

Thomas Mailund1  
(1)
Aarhus, Denmark
 

Even if we only consider tabular data, there are still many different ways to format this data. The packages in the Tidyverse expect that data is represented as so-called “tidy data”1 (which is where the name Tidyverse comes from). The tidyr package helps you with formatting your data into tidy data.

You can load tidyr as part of the tidyverse:
library(tidyverse)
or on its own
library(tidyr)

Tidy Data

The fundamental properties that characterize tidy data are that each variable is in a column, and each observation is a row. The terms like variables and observations should be familiar from statistics, but what the tidy data properties say is that you should not put the values of a variable in different columns and you should not put more than one observation in the same row.

Consider data such as this:
mean_income<-tribble(
   ~country,~  '2001',~ '2002',~ '2003',~ '2004',~ '2005',
   "Numenor",  123456,  132654,  321646,  324156,  325416,
   "Westeros", 314256,  432165,  546123,  465321,  561423,
   "Narnia",   432156,  342165,  564123,  543216,  465321,
   "Gondor",   531426,  321465,  235461,  463521,  561423,
   "Laputa",   14235,   34125,   45123,   51234,    54321
)

We could imagine that this table contains the mean income in five (fictional) countries in the first five years of the 21st century. Let us say that you are interested in knowing if there is a pattern in income over time when taking the different countries into account. To do this, you might consider a formula such as income ~ country + year, but the data is not formatted in a way that makes this easy. If year is a variable that you can use in the model, it is a problem that different observations, the actual years, are split into different columns. Similarly, if the mean income per year is a variable, it is a problem that all the observations for a country are in the same row. You cannot use income ~ country + year because the data is not tidy.

Pivoting

The function pivot_longer() can merge several columns into two—a key and a value column. The key column will be a discrete variable with values taken from the column names, and the value column will contain the data from the original columns.

Consider the mean income data frame. We want the actual years to be different outcomes of a year variable and have the income for each year to be another variable, income. We can get such a tibble by calling pivot_longer() with these parameters:
mean_income |> pivot_longer(
    names_to = "year",
    values_to = "income",
    cols = c(`2001`, `2001`, `2002`,
    `2003`, `2004`, `2005`)
)
## # A tibble: 25 × 3
##    country  year  income
##    <chr>    <chr>  <dbl>
##  1 Numenor  2001  123456
##  2 Numenor  2002  132654
##  3 Numenor  2003  321646
##  4 Numenor  2004  324156
##  5 Numenor  2005  325416
##  6 Westeros 2001  314256
##  7 Westeros 2002  432165
##  8 Westeros 2003  546123
##  9 Westeros 2004  465321
## 10 Westeros 2005  561423
## # . . . with 15 more rows

The first argument is the input data from mean_income, and I provide it using the |> operator. This works exactly as if I had given mean_income as the first argument to the function call, but using the operator signals to the reader that mean_income is data flowing into the function and different from other arguments because of it.

The names_to and values_to arguments are the names of two new columns. In the names_to column, we get the names of the columns we get the values from, and in the value column, we get the matching values from those columns. The remaining parameters specify which columns we wish to merge.

In the preceding example, I explicitly listed all the columns to include, but tidyr’s functions support tidy select (see Chapter 4), so you can also use a range of columns, like this:
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    cols = '2001':'2005'
)
## # A tibble: 25 × 3
##    country  year  income
##    <chr>    <chr>  <dbl>
##  1 Numenor  2001  123456
##  2 Numenor  2002  132654
##  3 Numenor  2003  321646
##  4 Numenor  2004  324156
##  5 Numenor  2005  325416
##  6 Westeros 2001  314256
##  7 Westeros 2002  432165
##  8 Westeros 2003  546123
##  9 Westeros 2004  465321
## 10 Westeros 2005  561423
## # . . . with 15 more rows
You can use more than one range:
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    cols =c( '2001':'2002', '2004':'2005')
)
but the range specification is only useful when the relevant columns are contiguous. You can also specify the complement of the columns you want to merge using !. For example, if you want to merge the years, you can pick all columns except for country rather than pick all the years:
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    cols =!country
)
## # A tibble: 25 × 3
##    country  year  income
##    <chr>    <chr>  <dbl>
##  1 Numenor  2001  123456
##  2 Numenor  2002  132654
##  3 Numenor  2003  321646
##  4 Numenor  2004  324156
##  5 Numenor  2005  325416
##  6 Westeros 2001  314256
##  7 Westeros 2002  432165
##  8 Westeros 2003  546123
##  9 Westeros 2004  465321
## 10 Westeros 2005  561423
## # . . . with 15 more rows

You can specify more than one column in a complement, and you can combine complements with ranges.

Generally, you can combine pivot_longer() with all the tidy selection functions to select columns, for example, you can use starts_with() or combine starts_with() and complement, !, or anything else the tidy selection language allows.
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    starts_with("2")
)
## # A tibble: 25 × 3
##    country  year  income
##    <chr>    <chr>  <dbl>
##  1 Numenor  2001  123456
##  2 Numenor  2002  132654
##  3 Numenor  2003  321646
##  4 Numenor  2004  324156
##  5 Numenor  2005  325416
##  6 Westeros 2001  314256
##  7 Westeros 2002  432165
##  8 Westeros 2003  546123
##  9 Westeros 2004  465321
## 10 Westeros 2005  561423
## # . . . with 15 more rows
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    !starts_with("c")
)
## # A tibble: 25 × 3
##    country  year  income
##    <chr>    <chr>  <dbl>
##  1 Numenor  2001  123456
##  2 Numenor  2002  132654
##  3 Numenor  2003  321646
##  4 Numenor  2004  324156
##  5 Numenor  2005  325416
##  6 Westeros 2001  314256
##  7 Westeros 2002  432165
##  8 Westeros 2003  546123
##  9 Westeros 2004  465321
## 10 Westeros 2005  561423
## # . . . with 15 more rows
The column names that go into the names_to column will be strings . If you need another type, such as logical or numeric values, you can achieve that by setting the names_transform argument. If we want all the years to be represented as integers, we could do this:
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    !country,
    names_transform =as.integer
)
## # A tibble: 25 × 3
##    country  year  income
##    <chr>    <int>  <dbl>
##  1 Numenor  2001  123456
##  2 Numenor  2002  132654
##  3 Numenor  2003  321646
##  4 Numenor  2004  324156
##  5 Numenor  2005  325416
##  6 Westeros 2001  314256
##  7 Westeros 2002  432165
##  8 Westeros 2003  546123
##  9 Westeros 2004  465321
## 10 Westeros 2005  561423
## # . . . with 15 more rows

Because all the years are integers, the year column now contains integers . If we want to analyze this data set, we most likely want to interpret the years as integers.

If we wanted factors instead, though, we could also do that:
mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    !country,
    names_transform =as.factor
)

There is a similar values_transform argument if you need to transform the values that go into the values_to column.

If you work with functions that do not expect tidy data and rather expect to find related data in different columns, you can translate tidy data into that format using the pivot_wider() function . It is the reverse of pivot_longer().2
tidy_income<-mean_income|>pivot_longer(
    names_to ="year",
    values_to ="income",
    !country
)
tidy_income|>pivot_wider(
    names_from ="year",
    values_from ="income"
)
## # A tibble: 5 × 6
##   country `2001` `2002` `2003` `2004` `2005`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Numenor  123456 132654 321646 324156 325416
## 2 Westeros 314256 432165 546123 465321 561423
## 3 Narnia   432156 342165 564123 543216 465321
## 4 Gondor   531426 321465 235461 463521 561423
## 5 Laputa    14235  34125  45123  51234  54321

The names_from and values_from arguments pick the values to use as column names and the values to put in those columns, respectively. They are the same arguments that you would use for pivot_longer()’s names_to and values_to arguments to translate the result back again.

Complex Column Encodings

Different variables are not always represented in separated columns. For example, it is not uncommon to have a column that contains a date, but that is really a day and a month and possibly a year. The best representation of a date is, of course, a date object, but for the sake of the example, let us say that we want to split a date into a day and a month column. You can do this using the separate() function .
tbl<-tribble(
   ~date,
   "11/5",
   "4/7",
   "21/12"
)
separate(tbl, date,into =c("day","month"))
## # A tibble: 3 × 2
##    day    month
##    <chr>  <chr>
## 1  11     5
## 2  4      7
## 3  21     12

The first argument is the data table, the second the column you want to split, and the into argument is a list of the columns the original date will be put into.

By default, the original column is removed, but if you want to keep it, you can use the remove argument .
separate(tbl, date,into =c("day","month"),remove =FALSE)
## # A tibble: 3 × 3
##   date  day   month
##   <chr> <chr> <chr>
## 1 11/5  11    5
## 2 4/7   4     7
## 3 21/12 21    12
After separating the fields in the data column, which is a column of strings, you get a day and a month column, and these also contain strings . To convert them into numbers, you can use the convert argument .
separate(tbl, date,into =c("day","month"),convert =TRUE)
## # A tibble: 3 × 2
##    day   month
##    <int> <int>
## 1  11        5
## 2  4         7
## 3  21       12
A column will be split by default on any non-alphanumeric character. This is why the / in the data is correctly used as the separator between the day and the month.3 If your separator is something else, you can use the sep argument. The argument takes a regular expression, and the column will be split on this. For example, we can split on alphabet characters and spaces like this:
tbl<-tribble(
   ~date,
   "11th of month 5",
   "4th of month 7",
   "21st of month 12"
)
separate(tbl, date,into =c("day","month"),
         sep ="[[:alpha:][:space:]]+")
## # A tibble: 3 × 2
##   day   month
##   <chr> <chr>
## 1 11    5
## 2 4     7
## 3 21    12

The regular expression ([[:alpha:]]|[[:space:]])+ specifies any nonzero (specified by +) sequence of the character classes in (the outermost) square brackets, where we have put alphabet characters [:alpha:] and spaces [:space:].

The date format in this example is not something you would find in the wild, but if we have the date in this format
tbl<-tribble(
   ~date,
   "11th of May",
   "4th of July",
   "21st of December"
)

we cannot easily distinguish the separator string from the months. If the delimiter you need is not a single regular expression, you might be able to use the extract() function instead of separate(). With extract(), you can use regular expression groups to identify the fields to extract.

You need an expression with groups—those are subexpressions in parentheses—and what those groups match will be the value put in the new columns. For our dates, it could look like this:
tbl|>extract(
   col =date,
   into =c("day","month"),
   regex =paste0(
       "([[:digit:]]+)",        # First group, the day
       "[[:alpha:][:space:]]+", # Stuff we ignore; will match "th of" and such
       "[[:space:]]",           # The final space before the month name
      "([[:alpha:]]+)"          # Second group, the month
   )
)
## # A tibble: 3 × 1
##   date
##   <chr>
## 1 11th of May
## 2 4th of July
## 3 21st of December
For both separate() and extract(), you can have more than two resulting columns. You can provide more than two names to the into parameter, and then the expression you use for separating and extracting should provide the right number of values, of course.
tbl|>extract(
   col =date,
   into =c("day","fluf","month"),
   regex =paste0(
       "([[:digit:]]+)",       # First group, the day
       "([[:alpha:]]+)",       # st, nd, rd, th, that kind of stuff goes in fluf
       "[[:alpha:][:space:]]+",# stuff between day and month; not a group so we throw it away
       "([[:alpha:]]+)"        # Third group, the month
   )
)
## # A tibble: 3 × 1
##   date
##   <chr>
## 1 11th of May
## 2 4th of July
## 3 21st of December
The reverse of separating is uniting. For example, let tbl2 contain a day and a month column. I’m building such a table in the following code:
tbl<-tribble(
   ~date,
   "11/5",
   "4/7",
   "21/12"
)
tbl2<-tbl|>separate(col =date,into =c("day","month"))
Having created such a tibble in tbl2, we can use the unite() function to undo separate() much as we found pivot_longer/wider() to be able to undo each other (with the caveats that things can get lost of course). The col argument should be the column we create and the following columns those we create it from.
tbl2|>unite(col ="date", day, month)
## # A tibble: 3 × 1
##   date
##   <chr>
## 1 11_5
## 2 4_7
## 3 21_12
By default, the separator in the new column will be _. This is not what we want for dates, but we can use the sep argument to change it.
tbl2|>unite(col ="date", day, month,sep ="/")
## # A tibble: 3 × 1
##   date
##   <chr>
## 1 11/5
## 2 4/7
## 3 21/12
If we want to keep the original columns around, we can set the remove argument to FALSE.
tbl2|>unite(
   col ="date", day, month,
   sep ="/",remove =FALSE
)
## # A tibble: 3 × 3
##   date   day   month
##   <chr> <chr>  <chr>
## 1 11/5   11    5
## 2 4/7    4     7
## 3 21/12  21    12
Columns that contain more than one value do not always contain the same number of values. For example, we could have data such as the number of casualties per major group in WW1 and WW2.
military_casualties<-tribble(
    ~war,~groups,~deaths,
    'WW1',
    "Allied Powers/Central Powers",
    "5.7,4.0",
    'WW2',
    "Germany/Japan/USSR/British Empire/USA",
    "5.3,2.1,10.7,0.6,0.4"
)
The groupings were not the same in the two wars, so we cannot split the data into different columns. We can, however, divide it into more rows using separate_rows() .
military_casualties|>separate_rows(
    groups, deaths,
    sep ="/|,"
)
## # A tibble: 7 × 3
##   war   groups          deaths
##   <chr> <chr>           <chr>
## 1 WW1   Allied Powers   5.7
## 2 WW1   Central Powers  4.0
## 3 WW2   Germany         5.3
## 4 WW2   Japan           2.1
## 5 WW2   USSR            10.7
## 6 WW2   British Empire  0.6
## 7 WW2   USA             0.4

The expression we use for sep says we should split on / or , (where the | in the string is regular expression speak for “or”).

We might not want the deaths here to be strings, but with convert = TRUE we can turn them into numbers:
military_casualties|>separate_rows(
    groups, deaths,
    sep ="/|,",
    convert =TRUE
)
## # A tibble: 7 × 3
##   war   groups         deaths
##   <chr> <chr>          <dbl>
## 1 WW1   Allied Powers  5.7
## 2 WW1   Central Powers 4
## 3 WW2   Germany        5.3
## 4 WW2   Japan          2.1
## 5 WW2   USSR           10.7
## 6 WW2   British Empire 0.6
## 7 WW2   USA            0.4

Expanding, Crossing, and Completing

For some applications, it is useful to create all combinations of values from two or more columns—even those combinations that are missing from the data. For this, you can use the function expand() :
tbl<-tribble(
   ~A,~B,~C,
    1,11,21,
    2,11,22,
    4,13,32
)
tbl|>expand(A, B)
## # A tibble: 6 × 2
##       A     B
##   <dbl> <dbl>
## 1     1    11
## 2     1    13
## 3     2    11
## 4     2    13
## 5     4    11
## 6     4    13
Here, we get all combinations of values from columns A and B while column C is ignored. If a column contains values from a larger set than what is found in the data, you can give expand() a vector of values:
tbl |> expand(A = 1:4, B)
## # A tibble: 8 × 2
##       A     B
##   <int> <dbl>
## 1     1    11
## 2     1    13
## 3     2    11
## 4     2    13
## 5     3    11
## 6     3    13
## 7     4    11
## 8     4    13
If you have vectors of values you want to combine this way, you can also use the crossing() function .
crossing(A = 1:3, B = 11:13)
## # A tibble: 9 × 2
##       A     B
##   <int> <int>
## 1     1    11
## 2     1    12
## 3     1    13
## 4     2    11
## 5     2    12
## 6     2    13
## 7     3    11
## 8     3    12
## 9     3    13
If you only want the combinations found in the data, you can combine expand() with the nesting() function :
tbl <- tribble(
    ~A, ~B, ~C,
    1, 11, 21,
    2, 11, 22,
    2, 11, 12,
    4, 13, 42,
    4, 13, 32
)
expand(tbl, nesting(A, B))
## # A tibble: 3 × 2
##       A     B
##   <dbl> <dbl>
## 1     1    11
## 2     2    11
## 3     4    13

This combination gives you all the combinations of the specified columns—all other columns are ignored—and only the unique combinations, unlike if you extracted the columns by indexing. This is useful if you have data with lots of duplications, but you only need to compute some statistics for the unique combinations.

If you need all combinations in a data frame, but accept missing values for those that are not present, you can use the complete() function. It makes sure you have all combinations and put missing values in the remaining columns for the combinations that are not in the original data.
complete(tbl, A = 1:4)
## # A tibble: 6 × 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1    11    21
## 2     2    11    22
## 3     2    11    12
## 4     3    NA    NA
## 5     4    13    42
## 6     4    13    32
complete(tbl, B = 11:13)
## # A tibble: 6 × 3
##       B     A     C
##   <dbl> <dbl> <dbl>
## 1    11     1    21
## 2    11     2    22
## 3    11     2    12
## 4    12    NA    NA
## 5    13     4    42
## 6    13     4    32
complete(tbl, A = 1:4, B = 11:13)
## # A tibble: 14 × 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1    11    21
## 2     1    12    NA
## 3     1    13    NA
## 4     2    11    22
## 5     2    11    12
## 6     2    12    NA
## 7     2    13    NA
## 8     3    11    NA
## 9     3    12    NA
## 10    3    13    NA
## 11    4    11    NA
## 12    4    12    NA
## 13    4    13    42
## 14    4    13    32

Missing Values

When data has missing values , it often requires application domain knowledge to deal with it correctly. The tidyr package has some rudimentary support for cleaning data with missing values; you might need more features to deal with missing values properly, and you are likely to find those in the dplyr package (see Chapter 8).

The simplest way to handle missing values is to get rid of it. A crude approach is to remove all observations with one or more missing variable values. The drop_na() function does exactly that.
mean_income <- tribble(
    ~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,
    "Numenor", 123456, 132654, NA, 324156,
    "Westeros", 314256, NA, NA, 465321,
    "Narnia", 432156, NA, NA, NA,
    "Gondor", 531426, 321465, 235461, 463521,
    "Laputa", 14235, 34125, 45123, 51234,
)
drop_na(mean_income)
## # A tibble: 2 × 5
##   country `2002` `2003` `2004` `2005`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
## 1 Gondor  531426 321465 235461 463521
## 2 Laputa   14235  34125  45123  51234
In this example, we lose a country entirely if we have missing data for a single year. This is unlikely to be acceptable and is another good argument for tidy data. If we reformat the table and drop missing values, we will only remove the observations where the income is missing for a given year.
mean_income |> pivot_longer(
    names_to = "year",
    values_to = "mean_income",
    -country
) |> drop_na()
## # A tibble: 14 × 3
##    country  year mean_income
##    <chr>    <chr>      <dbl>
## 1  Numenor  2002      123456
## 2  Numenor  2003      132654
## 3  Numenor  2005      324156
## 4  Westeros 2002      314256
## 5  Westeros 2005      465321
## 6  Narnia   2002      432156
## 7  Gondor   2002      531426
## 8  Gondor   2003      321465
## 9  Gondor   2004      235461
## 10 Gondor   2005      463521
## 11 Laputa   2002       14235
## 12 Laputa   2003       34125
## 13 Laputa   2004       45123
## 14 Laputa   2005       51234
Sometimes, you can replace missing data with some appropriate data. For example, we could replace missing data in the mean_income table by setting each value to the mean of its column. The replace_na() function does this. It takes a data frame as its first argument and a list specifying replacements as its second argument. This list should have a name for each column you want to replace missing values in, and the value assigned to these names should be the values with which you wish to replace.
replace_na(mean_income, list(
    `2003` = mean(mean_income$`2003`, na.rm = TRUE)
))
## # A tibble: 5 × 5
##   country  `2002` `2003`  `2004` `2005`
##   <chr>     <dbl>  <dbl>  <dbl>   <dbl>
## 1 Numenor  123456 132654     NA  324156
## 2 Westeros 314256 162748     NA  465321
## 3 Narnia   432156 162748     NA      NA
## 4 Gondor   531426 321465  235461 463521
## 5 Laputa    14235  34125   45123  51234
replace_na(mean_income, list(
    `2003` = mean(mean_income$`2003`, na.rm = TRUE),
    `2004` = mean(mean_income$`2004`, na.rm = TRUE),
    `2005` = mean(mean_income$`2005`, na.rm = TRUE)
))
## # A tibble: 5 × 5
##   country  `2002` `2003` `2004` `2005`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>
## 1 Numenor  123456 132654 140292 324156
## 2 Westeros 314256 162748 140292 465321
## 3 Narnia   432156 162748 140292 326058
## 4 Gondor   531426 321465 235461 463521
## 5 Laputa    14235  34125  45123  51234

In this example, using the mean per year is unlikely to be useful; it would make more sense to replace missing data with the mean for each country. This is not immediately possible with tidyr functions, though. With dplyr, we have the tools for it, and I return to it in Chapter 8.

A final function for managing missing data is the fill() function . It replaces missing values with the value above them in their column (or below them if you set the argument .direction to "up". Imagine that we have a table of income per quarter of each year, but the year is only mentioned for the first quarter.
tbl <- read_csv(
    "year, quarter, income
    2011, Q1, 13
    , Q2, 12
    , Q3, 14
    , Q4, 11
    2012, Q1, 12
    , Q2, 14
    , Q3, 15
    , Q4, 17",
show_col_types = FALSE
)
spec(tbl)
## cols(
##   year = col_double(),
##   quarter = col_character(),
##   income = col_double()
## )
We can repeat the years downward through the quarters with fill():
fill(tbl, year)
## # A tibble: 8 × 3
##    year quarter income
##   <dbl> <chr>    <dbl>
## 1  2011 Q1          13
## 2  2011 Q2          12
## 3  2011 Q3          14
## 4  2011 Q4          11
## 5  2012 Q1          12
## 6  2012 Q2          14
## 7  2012 Q3          15
## 8  2012 Q4          17

Nesting Data

A final functionality that tidyr provides is nesting data . A tibble entry usually contains simple data, such as numbers or strings , but it can also hold complex data, such as other tables.

Consider this table:
tbl<-tribble(
   ~A,~B,~C,
   1,11,21,
   1,11,12,
   2,42,22,
   2,15,22,
   2,15,32,
   4,13,32
)
From this table, we can create one that, for each value in the A column, contains a table of B and C for that A value.
nested_tbl <- tbl |> nest(BC = c(B, C))
nested_tbl
## # A tibble: 3 × 2
##       A BC
##   <dbl> <list>
## 1     1 <tibble [2 × 2]>
## 2     2 <tibble [3 × 2]>
## 3     4 <tibble [1 × 2]>
nested_tbl[[1,2]]
## [[1]]
## # A tibble: 2 × 2
##       B     C
##   <dbl> <dbl>
## 1     11   21
## 2     11   12

The column that contains the nested table is named BC because we used a named argument with that name, BC = c(B, C).

The reverse operation is unnest():
nested_tbl |> unnest(cols = c(BC))
## # A tibble: 6 × 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1    11    21
## 2     1    11    12
## 3     2    42    22
## 4     2    15    22
## 5     2    15    32
## 6     4    13    32

Nesting is not a typical operation in data analysis, but it is there if you need it, and in the packages in Chapter 12, we see some usages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.134.29