Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_4

4. Reformatting Tables: tidyr

Thomas Mailund¹

(1)

Aarhus, Denmark

Even if we only consider tabular data, there are still many different ways to format this data. The packages in the Tidyverse expect that data is represented as so-called “tidy data”¹ (which is where the name Tidyverse comes from). The tidyr package helps you with formatting your data into tidy data.

You can load tidyr as part of the Tidyverse

library(tidyverse)

or on its own

library(tidyr)

Tidy Data

The fundamental properties that characterize tidy data are that each variable is in a column, and each observation is in a row. The terms like variables and observations should be familiar from statistics, but what the tidy data properties say is that you should not put the values of a variable in different columns and you should not put more than one observation in the same row.

Consider data such as this:

mean_income <- tribble(

~country, ~`2001`, ~`2002`, ~`2003`, ~`2004`, ~`2005`,

"Numenor", 123456, 132654, 321646, 324156, 325416,

"Westeros", 314256, 432165, 546123, 465321, 561423,

"Narnia", 432156, 342165, 564123, 543216, 465321,

"Gondor", 531426, 321465, 235461, 463521, 561423,

"Laputa", 14235, 34125, 45123, 51234, 54321

)

We could imagine that this table contains the mean income in five (fictional) countries in the first 5 years of the twenty-first century. Let us say that you are interested in knowing if there is a pattern in income over time when taking the different countries into account. To do this, you might consider a formula such as income ~ country + year, but the data is not formatted in a way that makes this easy. If year is a variable that you can use in the model, it is a problem that different observations, the actual years, are split into different columns. Similarly, if the mean income per year is a variable, it is a problem that all the observations for a country are in the same row. You cannot use income ~ country + year because the data is not tidy.

Gather and Spread

The function gather() can merge several columns into two—a key and a value column. The key column will be a discrete variable with values taken from the column names, and the value column will contain the data from the original columns.

Consider the mean income data frame. We want the actual years to be different outcomes of a year variable and have the income for each year to be another variable, income. We can get such a tibble by calling gather() with these parameters:

gather(

data = mean_income,

key = "year",

value = "income",

`2001`, `2001`, `2002`,

`2003`, `2004`, `2005`

)

## # A tibble: 25 x 3

## country year income

## <chr> <chr> <dbl>

## 1 Numenor 2001 123456

## 2 Westeros 2001 314256

## 3 Narnia 2001 432156

## 4 Gondor 2001 531426

## 5 Laputa 2001 14235

## 6 Numenor 2002 132654

## 7 Westeros 2002 432165

## 8 Narnia 2002 342165

## 9 Gondor 2002 321465

## 10 Laputa 2002 34125

## # ... with 15 more rows

The data argument is the input data, and the key and value arguments are the names of two new columns. In the key column, we get the names of the columns we gather, and in the value column, we get the matching values from those columns. The remaining parameters specify which columns we wish to gather.

In the preceding example, I explicitly listed all the columns to gather. You can also use a range of columns, like this:

gather(

data = mean_income,

key = "year",

value = "income",

`2001`:`2005`

)

You can use more than one range:

gather(

data = mean_income,

key = "year",

value = "income",

`2001`:`2002`,

`2004`:`2005`

)

but the range specification is only useful when the relevant columns are contiguous. You can also specify the complement of the columns you want to gather using minus. For example, if you want to gather the years, you can pick all columns except for country rather than pick all the years:

gather(

data = mean_income,

key = "year",

value = "income",

-country

)

You can specify more than one column in a complement, and you can combine complements with ranges.

Finally, you can combine gather() with tidy selection (see Chapter 3) to select columns.

gather(

data = mean_income,

key = "year",

value = "income",

tidyselect::starts_with("2")

)

gather(

data = mean_income,

key = "year",

value = "income",

-tidyselect::starts_with("c")

)

The column names that go into the key column will be strings. If they can be translated into logical or numeric values, you can do so by setting the argument convert to TRUE .

gather(

data = mean_income,

key = "year",

value = "income",

-country,

convert = TRUE

)

## # A tibble: 25 x 3

## country year income

## <chr> <int> <dbl>

## 1 Numenor 2001 123456

## 2 Westeros 2001 314256

## 3 Narnia 2001 432156

## 4 Gondor 2001 531426

## 5 Laputa 2001 14235

## 6 Numenor 2002 132654

## 7 Westeros 2002 432165

## 8 Narnia 2002 342165

## 9 Gondor 2002 321465

## 10 Laputa 2002 34125

## # ... with 15 more rows

Because all the years are integers, the year column now contains integers. If we want to analyze this data set, we most likely want to interpret the years as integers.

If you want to translate strings into factors, you have to do this explicitly after the transformation.

If you work with functions that do not expect tidy data and rather expect to find related data in different columns, you can translate tidy data into that format using the spread() function . It is the reverse of gather().²

tidy_income <- gather(

data = mean_income,

key = "year",

value = "income",

-country,

convert = TRUE

)

spread(

data = tidy_income,

key = "year",

value = "income"

)

## # A tibble: 5 x 6

## country `2001` `2002` `2003` `2004` `2005`

## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>

## 1 Gondor 531426 321465 235461 463521 561423

## 2 Laputa 14235 34125 45123 51234 54321

## 3 Narnia 432156 342165 564123 543216 465321

## 4 Numenor 123456 132654 321646 324156 325416

## 5 Westeros 314256 432165 546123 465321 561423

The key and value arguments pick the values to use as column names and the values to put in those columns, respectively. They are the same arguments that you would use for gather()’s key and value arguments to translate the result back again.

Complex Column Encodings

Different variables are not always represented in separated columns. For example, it is not uncommon to have a column that contains a date, but that is really a day and a month and possibly a year. The best representation of a date is, of course, a date object, but for the sake of the example, let us say that we want to split a date into a day and a month column. You can do this using the separate() function .

tbl <- tribble(

~date,

"11/5",

"4/7",

"21/12"

)

separate(tbl, date, into = c("day", "month"))

## # A tibble: 3 x 2

## day month

## <chr> <chr>

## 1 11 5

## 2 4 7

## 3 21 12

The first argument is the data table, the second the column you want to split, and the into argument is a list of the columns the original date will be put into.

By default, the original column is removed, but if you want to keep it, you can use the remove argument .

separate(tbl, date, into = c("day", "month"), remove = FALSE)

## # A tibble: 3 x 3

## date day month

## <chr> <chr> <chr>

## 1 11/5 11 5

## 2 4/7 4 7

## 3 21/12 21 12

After separating the fields in the data column, which is a column of strings, you get a day and a month column, and these also contain strings. To convert them into numbers, you can use the convert argument .

separate(tbl, date, into = c("day", "month"), convert = TRUE)

## # A tibble: 3 x 2

## day month

## <int> <int>

## 1 11 5

## 2 4 7

## 3 21 12

By default a column will be split on any non-alphanumeric character. This is why / in the data is correctly used as the separator between the day and the month. If your separator is something else, you can use the sep argument . The argument takes a regular expression, and the column will be split on this. For example, we can split on alphabet characters and spaces like this:

tbl <- tribble(

~date,

"11th of month 5",

"4th of month 7",

"21st of month 12"

)

separate(tbl, date, into = c("day", "month"),

sep = "([:alpha:]|[:space:])+")

## # A tibble: 3 x 2

## day month

## <chr> <chr>

## 1 11 5

## 2 4 7

## 3 21 12

The date format in this example is not something you would find in the wild, but if we have the date in this format

tbl <- tribble(

~date,

"11th of May",

"4th of July",

"21st of December"

)

we cannot easily distinguish the separator string from the months. If the delimiter you need is not a single regular expression, you might be able to use the extract() function instead of separate(). With extract() you can use regular expression groups to identify the fields to extract.

regexp = "([:digit:]+)"

"(?:[:alpha:]|[:space:])+"

"[:space:]([:alpha:]+)"

extract(

tbl, date,

into = c("day", "month"),

regex = regexp

)

## # A tibble: 3 x 2

## day month

## <chr> <chr>

## 1 11 May

## 2 4 July

## 3 21 December

The reverse of separating is uniting. For example, let tbl2 contain a day and a month column.

tbl <- tribble(

~date,

"11/5",

"4/7",

"21/12"

)

tbl2 <- separate(tbl, date, into = c("day", "month"))

To create a date column from the day and the month, we use unite(). The col argument should be the column we create and the following columns those we create it from.

unite(tbl2, col = "date", day, month)

## # A tibble: 3 x 1

## date

## <chr>

## 1 11_5

## 2 4_7

## 3 21_12

By default, the separator in the new column will be _. This is not what we want for dates, but we can use the sep argument to change it.

unite(tbl2, col = "date", day, month, sep = "/")

## # A tibble: 3 x 1

## date

## <chr>

## 1 11/5

## 2 4/7

## 3 21/12

If we want to keep the original columns around, we can set the remove argument to FALSE.

unite(

tbl2,

col = "date", day, month,

sep = "/", remove = FALSE

)

## # A tibble: 3 x 3

## date day month

## <chr> <chr> <chr>

## 1 11/5 11 5

## 2 4/7 4 7

## 3 21/12 21 12

Columns that contain more than one value do not always contain the same number of values. For example, we could have data such as the number of casualties per major group in WW1 and WW2.

military_casualties <- tribble(

~war, ~groups, ~deaths,

'WW1',

"Allied Powers/Central Powers",

"5.7,4.0",

'WW2',

"Germany/Japan/USSR/British Empire/USA",

"5.3,2.1,10.7,0.6,0.4"

)

The groupings were not the same in the two wars, so we cannot split the data into different columns. We can, however, divide it into more rows using separate_rows() .

separate_rows(

military_casualties,

groups, deaths,

sep = "/|,"

)

## # A tibble: 7 x 3

## war groups deaths

## <chr> <chr> <chr>

## 1 WW1 Allied Powers 5.7

## 2 WW1 Central Powers 4.0

## 3 WW2 Germany 5.3

## 4 WW2 Japan 2.1

## 5 WW2 USSR 10.7

## 6 WW2 British Empire 0.6

## 7 WW2 USA 0.4

Expanding, Crossing, and Completing

For some applications, it is useful to create all combinations of values from two or more columns—even those combinations that are missing from the data. For this, you can use the function expand() :

tbl <- tribble(

~A, ~B, ~C,

1, 11, 21,

2, 11, 22,

4, 13, 32

)

expand(tbl, A, B)

## # A tibble: 6 x 2

## A B

## <dbl> <dbl>

## 1 1 11

## 2 1 13

## 3 2 11

## 4 2 13

## 5 4 11

## 6 4 13

Here, we get all combinations of values from columns A and B, while column C is ignored. If a column contains values from a larger set than what is found in the data, you can give expand() a vector of values:

expand(tbl, A = 1:4, B)

## # A tibble: 8 x 2

## A B

## <int> <dbl>

## 1 1 11

## 2 1 13

## 3 2 11

## 4 2 13

## 5 3 11

## 6 3 13

## 7 4 11

## 8 4 13

If you have vectors of values you want to combine this way, you can also use the crossing() function .

crossing(A = 1:3, B = 11:13)

## # A tibble: 9 x 2

## A B

## <int> <int>

## 1 1 11

## 2 1 12

## 3 1 13

## 4 2 11

## 5 2 12

## 6 2 13

## 7 3 11

## 8 3 12

## 9 3 13

If you only want the combinations found in the data, you can combine expand() with the nesting() function :

tbl <- tribble(

~A, ~B, ~C,

1, 11, 21,

2, 11, 22,

2, 11, 12,

4, 13, 42,

4, 13, 32

)

expand(tbl, nesting(A, B))

## # A tibble: 3 x 2

## A B

## <dbl> <dbl>

## 1 1 11

## 2 2 11

## 3 4 13

This combination gives you all the combinations of the specified columns—all other columns are ignored—and only the unique combinations, unlike if you extracted the columns by indexing.

If you need all combinations in a data frame, but accept missing values for those that are not present, you can use the complete() function . It makes sure you have all combinations and put missing values in the remaining columns for the combinations that are not in the original data.

complete(tbl, A = 1:4)

## # A tibble: 6 x 3

## A B C

## <dbl> <dbl> <dbl>

## 1 1 11 21

## 2 2 11 22

## 3 2 11 12

## 4 3 NA NA

## 5 4 13 42

## 6 4 13 32

complete(tbl, B = 11:13)

## # A tibble: 6 x 3

## B A C

## <dbl> <dbl> <dbl>

## 1 11 1 21

## 2 11 2 22

## 3 11 2 12

## 4 12 NA NA

## 5 13 4 42

## 6 13 4 32

complete(tbl, A = 1:4, B = 11:13)

## # A tibble: 14 x 3

## A B C

## <dbl> <dbl> <dbl>

## 1 1 11 21

## 2 1 12 NA

## 3 1 13 NA

## 4 2 11 22

## 5 2 11 12

## 6 2 12 NA

## 7 2 13 NA

## 8 3 11 NA

## 9 3 12 NA

## 10 3 13 NA

## 11 4 11 NA

## 12 4 12 NA

## 13 4 13 42

## 14 4 13 32

Missing Values

When data has missing values, it often requires application domain knowledge to deal with it correctly. The tidyr package has some rudimentary support for cleaning data with missing values; you might need more features to deal with missing values properly, and you are likely to find those in the dplyr package; see Chapter 7.

The simplest way to handle missing values is to get rid of it. A crude approach is to remove all observations with one or more missing variable values. The drop_na() function does exactly that.

mean_income <- tribble(

~country, ~`2002`, ~`2003`, ~`2004`, ~`2005`,

"Numenor", 123456, 132654, NA, 324156,

"Westeros", 314256, NA, NA, 465321,

"Narnia", 432156, NA, NA, NA,

"Gondor", 531426, 321465, 235461, 463521,

"Laputa", 14235, 34125, 45123, 51234,

)

drop_na(mean_income)

## # A tibble: 2 x 5

## country `2002` `2003` `2004` `2005`

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 Gondor 531426 321465 235461 463521

## 2 Laputa 14235 34125 45123 51234

In this example, we lose a country entirely if we have missing data for a single year. This is unlikely to be acceptable and is another good argument for tidy data. If we reformat the table and drop missing values, we will only remove the observations where the income is missing for a given year.

tidy_mean_income <- gather(

data = mean_income,

key = "year",

value = "mean_income",

-country

)

drop_na(tidy_mean_income)

## # A tibble: 14 x 3

## country year mean_income

## <chr> <chr> <dbl>

## 1 Numenor 2002 123456

## 2 Westeros 2002 314256

## 3 Narnia 2002 432156

## 4 Gondor 2002 531426

## 5 Laputa 2002 14235

## 6 Numenor 2003 132654

## 7 Gondor 2003 321465

## 8 Laputa 2003 34125

## 9 Gondor 2004 235461

## 10 Laputa 2004 45123

## 11 Numenor 2005 324156

## 12 Westeros 2005 465321

## 13 Gondor 2005 463521

## 14 Laputa 2005 51234

Sometimes, you can replace missing data with some appropriate data. For example, we could replace missing data in the mean_income table by setting each value to the mean of its column. The replace_na() function does this. It takes a data frame as its first argument and a list specifying replacements as its second argument. This list should have a name for each column you want to replace missing values in, and the value assigned to these names should be the values with which you wish you replace.

replace_na(mean_income, list(

`2003` = mean(mean_income$`2003`, na.rm = TRUE)

))

## # A tibble: 5 x 5

## country `2002` `2003` `2004` `2005`

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 Numenor 123456 132654 NA 324156

## 2 Westeros 314256 162748 NA 465321

## 3 Narnia 432156 162748 NA NA

## 4 Gondor 531426 321465 235461 463521

## 5 Laputa 14235 34125 45123 51234

replace_na(mean_income, list(

`2003` = mean(mean_income$`2003`, na.rm = TRUE),

`2004` = mean(mean_income$`2004`, na.rm = TRUE),

`2005` = mean(mean_income$`2005`, na.rm = TRUE)

))

## # A tibble: 5 x 5

## country `2002` `2003` `2004` `2005`

## <chr> <dbl> <dbl> <dbl> <dbl>

## 1 Numenor 123456 132654 140292 324156

## 2 Westeros 314256 162748 140292 465321

## 3 Narnia 432156 162748 140292 326058

## 4 Gondor 531426 321465 235461 463521

## 5 Laputa 14235 34125 45123 51234

In this example, using the mean per year is unlikely to be useful; it would make more sense to replace missing data with the mean for each country. This is not immediately possible with tidyr functions, though. With dplyr we have the tools for it and I return to it in Chapter 7.

A final function for managing missing data is the fill() function . It replaces missing values with the value above them in their column (or below them if you set the argument .direction to "up". Imagine that we have a table of income per quarter of each year, but the year is only mentioned for the first quarter.

tbl <- read_csv(

"year, quarter, income

2011, Q1, 13

, Q2, 12

, Q3, 14

, Q4, 11

2012, Q1, 12

, Q2, 14

, Q3, 15

, Q4, 17"

)

We can repeat the years downward through the quarters with fill() :

fill(tbl, year)

## # A tibble: 8 x 3

## year quarter income

## <dbl> <chr> <dbl>

## 1 2011 Q1 13

## 2 2011 Q2 12

## 3 2011 Q3 14

## 4 2011 Q4 11

## 5 2012 Q1 12

## 6 2012 Q2 14

## 7 2012 Q3 15

## 8 2012 Q4 17

Nesting Data

A final functionality that tidyr provides is nesting data. A tibble entry usually contains simple data, such as numbers or strings, but it can also hold complex data, such as other tables.

Consider this table:

tbl <- tribble(

~A, ~B, ~C,

1, 11, 21,

1, 11, 12,

2, 42, 22,

2, 15, 22,

2, 15, 32,

4, 13, 32

)

From this table, we can create one that, for each value in the A column, contains a table of B and C for that A value.

nested_tbl <- nest(tbl, B, C)

nested_tbl

## # A tibble: 3 x 2

## A data

## <dbl> <list>

## 1 1 <tibble [2 × 2]>

## 2 2 <tibble [3 × 2]>

## 3 4 <tibble [1 × 2]>

nested_tbl[[1,2]]

## # A tibble: 2 x 2

## B C

## <dbl> <dbl>

## 1 11 21

## 2 11 12

The column that contains the nested table is named data, but you can change this using the .key argument .

nested_tbl <- nest(tbl, B, C, .key = "BC")

nested_tbl

## # A tibble: 3 x 2

## A BC

## <dbl> <list>

## 1 1 <tibble [2 × 2]>

## 2 2 <tibble [3 × 2]>

## 3 4 <tibble [1 × 2]>

nested_tbl$BC

## [[1]]

## # A tibble: 2 x 2

## B C

## <dbl> <dbl>

## 1 11 21

## 2 11 12

## [[2]]

## # A tibble: 3 x 2

## B C

## <dbl> <dbl>

## 1 42 22

## 2 15 22

## 3 15 32

## [[3]]

## # A tibble: 1 x 2

## B C

## <dbl> <dbl>

## 1 13 32

The arguments after the table pick the columns to be nested. You can use any number of columns here, including no columns at all. In the latter case, you will nest the entire original table in the data column .

nest(tbl, B)

## # A tibble: 5 x 3

## A C data

## <dbl> <dbl> <list>

## 1 1 21 <tibble [1 × 1]>

## 2 1 12 <tibble [1 × 1]>

## 3 2 22 <tibble [2 × 1]>

## 4 2 32 <tibble [1 × 1]>

## 5 4 32 <tibble [1 × 1]>

nest(tbl)

## # A tibble: 1 x 1

## data

## <list>

## 1 <tibble [6 × 3]>

The reverse operation is unnest() :

nested_tbl

## # A tibble: 3 x 2

## A BC

## <dbl> <list>

## 1 1 <tibble [2 × 2]>

## 2 2 <tibble [3 × 2]>

## 3 4 <tibble [1 × 2]>

unnest(nested_tbl)

## # A tibble: 6 x 3

## A B C

## <dbl> <dbl> <dbl>

## 1 1 11 21

## 2 1 11 12

## 3 2 42 22

## 4 2 15 22

## 5 2 15 32

## 6 4 13 32

Nesting is not a typical operation in data analysis, but it is there if you need it, and in the packages in Chapter 11 we see some usages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Reformatting Tables: tidyr

Create new playlist

Sign In

Sign Up

4. Reformatting Tables: tidyr

Tidy Data

Gather and Spread

Complex Column Encodings

Expanding, Crossing, and Completing

Missing Values

Nesting Data

Table of Contents for
4. Reformatting Tables: tidyr