12 Reshaping Data with tidyr

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12
Reshaping Data with `tidyr`

One of the most common data wrangling challenges is adjusting how exactly row and columns are used to represent your data. Structuring (or restructuring) data frames to have the desired shape can be the most difficult part of creating a visualization, running a statistical model, or implementing a machine learning algorithm.

This chapter describes how you can use the tidyr (“tidy-er”) package to effectively transform your data into an appropriate shape for analysis and visualization.

12.1 What Is “Tidy” Data?

When wrangling data into a data frame for your analysis, you need to decide on the desired structure of that data frame. You need to determine what each row and column will represent, so that you can consistently and clearly manipulate that data (e.g., you know what you will be selecting and what you will be filtering). The tidyr package is used to structure and work with data fames that follow three principles of tidy data (as described by the package’s documentation¹):

¹tidyr: https://tidyr.tidyverse.org

Each variable is in a column.
Each observation is a row.
Each value is a cell.

Indeed, these principles lead to the data structuring described in Chapter 9: rows represent observations, and columns represent features of that data.

However, asking different questions of a data set may involve different interpretations of what constitutes an “observation.” For example, Section 11.6 described working with the flights data set from the nycflights13 package, in which each observation is a flight. However, the analysis made comparisons between airlines, airports, and months. Each question worked with a different unit of analysis, implying a different data structure (e.g., what should be represented by each row). While the example somewhat changed the nature of these rows by grouping and joining different data sets, having a more specific data structure where each row represented a specific unit of analysis (e.g., an airline or a month) may have made much of the wrangling and analysis more straightforward.

To use multiple different definitions of an “observation” when investigating your data, you will need to create multiple representations (i.e., data frames) of the same data set—each with its own configuration of rows and columns.

To demonstrate how you may need to adjust what each observation represents, consider the (fabricated) data set of music concert prices shown in Table 12.1. In this table, each observation (row) represents a city, with each city having features (columns) of the ticket price for a specific band.

Table 12.1 A “wide” data set of concert ticket price in different cities. Each observation (i.e., unit of analysis) is a city, and each feature is the concert ticket price for a given band.

city	greensky_bluegrass	trampled_by_turtles	billy_strings	fruition
Seattle	40	30	15	30
Portland	40	20	25	50
Denver	20	40	25	40
Minneapolis	30	100	15	20

But consider if you wanted to analyze the ticket price across all concerts. You could not do this easily with the data in its current form, since the data is organized by city (not by concert)! You would prefer instead that all of the prices were listed in a single column, as a feature of a row representing a single concert (a city-and-band combination), as in Table 12.2.

Table 12.2 A “long” data set of concert ticket price by city and band. Each observation (i.e., unit of analysis) is a city–band combination, and each has a single feature that is the ticket price.

city	band	price
Seattle	greensky_bluegrass	40
Portland	greensky_bluegrass	40
Denver	greensky_bluegrass	20
Minneapolis	greensky_bluegrass	30
Seattle	trampled_by_turtles	30
Portland	trampled_by_turtles	20
Denver	trampled_by_turtles	40
Minneapolis	trampled_by_turtles	100
Seattle	billy_strings	15
Portland	billy_strings	25
Denver	billy_strings	25
Minneapolis	billy_strings	15
Seattle	fruition	30
Portland	fruition	50
Denver	fruition	40
Minneapolis	fruition	20

Both Table 12.1 and Table 12.2 represent the same set of data—they both have prices for 16 different concerts. But by representing that data in terms of different observations, they may better support different analyses. These data tables are said to be in a different orientation: the price data in Table 12.1 is often referred to being in wide format (because it is spread wide across multiple columns), while the price data in Table 12.2 is in long format (because it is in one long column). Note that the long format table includes some duplicated data (the names of the cities and bands are repeated), which is part of why the data might instead be stored in wide format in the first place!

12.2 From Columns to Rows: `gather()`

Sometimes you may want to change the structure of your data—how your data is organized in terms of observations and features. To help you do so, the tidyr package provides elegant functions for transforming between orientations.

For example, to move from wide format (Table 12.1) to long format (Table 12.2), you need to gather all of the prices into a single column. You can do this using the gather() function, which collects data values stored across multiple columns into a single new feature (e.g., “price” in Table 12.2), along with an additional new column representing which feature that value was gathered from (e.g., “band” in Table 12.2). In effect, it creates two columns representing key–value pairs of the feature and its value from the original data frame.

Click here to view code image

# Reshape by gathering prices into a single feature
band_data_long <- gather(
band_data_wide,   # data frame to gather from
  key = band,     # name for new column listing the gathered features
  value = price,  # name for new column listing the gathered values
  -city           # columns to gather data from, as in dplyr's `select()`
)

The gather() function takes in a number of arguments, starting with the data frame to gather from. It then takes in a key argument giving a name for a column that will contain as values the column names the data was gathered from—for example, a new band column that will contains the values "greensky_bluegrass", "trampled_by_turtles", and so on. The third argument is a value, which is the name for the column that will contain the gathered values—for example, price to contain the price numbers. Finally, the function takes in arguments representing which columns to gather data from, using syntax similar to using dplyr to select() those columns (in the preceding example, -city indicates that it should gather from all columns except city). Again, any columns provided as this final set of arguments will have their names listed in the key column, and their values listed in the value column. This process is illustrated in Figure 12.1. The gather() function’s syntax can be hard to intuit and remember; try tracing where each value “moves” in the table and diagram.

A screenshot to explain the "gather" function. — Figure 12.1 The `gather()` function takes values from multiple columns (`greensky_bluegrass`, `trampled_by_turtles`, etc.) and *gathers* them into a (new) single column (`price`). In doing so, it also creates a new column (`band`) that stores the names of the columns that were gathered (i.e., the column name in which each value was stored prior to gathering).

A sample data frame with five columns: city, greensky_bluegrass, trampled_by_turtles, billy_strings, and fruition with four rows is shown. After using the function, "gather(band_data, key = band, value = price, -city)" two new columns: band and price are formed. The columns, greensky_bluegrass, trampled_by_turtles, billy_strings, and fruition of the data frame are grouped under "band" where the entire data under these columns are grouped under "price". Thus the resultant data frame has three columns: city, band, and price, and 16 rows.

Note that once data is in long format, you can continue to analyze an individual feature (e.g., a specific band) by filtering for that value. For example, filter(band_data_long, band == "greensky_bluegrass") would produce just the prices for a single band.

12.3 From Rows to Columns: `spread()`

It is also possible to transform a data table from long format into wide format—that is, to spread out the prices into multiple columns. Thus, while the gather() function collects multiple features into two columns, the spread() function creates multiple features from two existing columns. For example, you can take the long format data shown in Table 12.2 and spread it out so that each observation is a band, as in Table 12.3:

Click here to view code image

# Reshape long data (Table 12.2), spreading prices out among multiple features
price_by_band <- spread(
  band_data_long, # data frame to spread from
  key = city,     # column indicating where to get new feature names
  value = price   # column indicating where to get new feature values
)

Table 12.3 A “wide” data set of concert ticket prices for a set of bands. Each observation (i.e., unit of analysis) is a band, and each feature is the ticket price in a given city.

band	Denver	Minneapolis	Portland	Seattle
billy_strings	25	15	25	15
fruition	40	20	50	30
greensky_bluegrass	20	30	40	40
trampled_by_turtles	40	100	20	30

The spread() function takes arguments similar to those passed to the gather() function, but applies them in the opposite direction. In this case, the key and value arguments are where to get the column names and values, respectively. The spread() function will create a new column for each unique value in the provided key column, with values taken from the value feature. In the preceding example, the new column names (e.g., "Denver", "Minneapolis") were taken from the city feature in the long format table, and the values for those columns were taken from the price feature. This process is illustrated in Figure 12.2.

A screenshot to explain the "spread" function. — Figure 12.2 The `spread()` function *spreads out* a single column into multiple columns. It creates a new column for each unique value in the provided `key` column (`city`). The values in each new column will be populated with the provided `value` column (`price`).

A sample data frame with three columns: city, band, and price, and 16 rows is shown. After using the function, "spread(band_data_long, key = city, value = price)," the city column is divided into four new columns, based on the entries in the city column: Denver, Minneapolis, Portland, and Seattle. The newly formed columns are populated with the values from the price column. Thus the resultant data frame has five columns: band, Denver, Minneapolis, Portland, and Seattle, and 4 rows.

By combining gather() and spread(), you can effectively change the “shape” of your data and what concept is represented by an observation.

Tip

Before spreading or gathering your data, you will often need to unite multiple columns into a single column, or to separate a single column into multiple columns. The tidyr functions unite()^a and separate()^b provide a specific syntax for these common data preparation tasks.

^a https://tidyr.tidyverse.org/reference/unite.html

^b https://tidyr.tidyverse.org/reference/separate.html

12.4 `tidyr` in Action: Exploring Educational Statistics

This section uses a real data set to demonstrate how reshaping your data with tidyr is an integral part of the data exploration process. The data in this example was downloaded from the World Bank Data Explorer,² which is a data collection of hundreds of indicators (measures) of different economic and social development factors. In particular, this example considers educational indicators³ that capture a relevant signal of a country’s level of (or investment in) education—for example, government expenditure on education, literacy rates, school enrollment rates, and dozens of other measures of educational attainment. The imperfections of this data set (unnecessary rows at the top of the .csv file, a substantial amount of missing data, long column names with special characters) are representative of the challenges involved in working with real data sets. All graphics in this section were built using the ggplot2 package, which is described in Chapter 16. The complete code for this analysis is also available online in the book’s code repository.⁴

²World Bank Data Explorer: https://data.worldbank.org

³World Bank education: http://datatopics.worldbank.org/education

⁴tidyr in Action: https://github.com/programming-for-data-science/in-action/tree/master/tidyr

After having downloaded the data, you will need to load it into your R environment:

Click here to view code image

# Load data, skipping the unnecessary first 4 rows
wb_data <- read.csv(
  "data/world_bank_data.csv",
  stringsAsFactors = F,
  skip = 4
)

When you first load the data, each observation (row) represents an indicator for a country, with features (columns) that are the values of that indicator in a given year (see Figure 12.3). Notice that many values, particularly for earlier years, are missing (NA). Also, because R does not allow column names to be numbers, the read.csv() function has prepended an X to each column name (which is just a number in the raw .csv file).

A screenshot shows a data frame representing the untransformed World Bank educational data. It has seven columns: Country.Name, Country.Code, Indicator.Name, Indicator.Code, X1960, X1961, and X1962, and seven rows. — Figure 12.3 Untransformed World Bank educational data used in Section 12.4.

While in terms of the indicator this data is in long format, in terms of the indicator and year the data is in wide format—a single column contains all the values for a single year. This structure allows you to make comparisons between years for the indicators by filtering for the indicator of interest. For example, you could compare each country’s educational expenditure in 1990 to its expenditure in 2014 as follows:

Click here to view code image

# Visually compare expenditures for 1990 and 2014

# Begin by filtering the rows for the indicator of interest
indicator <- "Government expenditure on education, total (% of GDP)"
expenditure_plot_data <- wb_data %>%
  filter(Indicator.Name == indicator)

# Plot the expenditure in 1990 against 2014 using the `ggplot2` package
# See Chapter 16 for details
expenditure_chart <- ggplot(data = expenditure_plot_data) +
  geom_text_repel(
    mapping = aes(x = X1990 / 100, y = X2014 / 100, label = Country.Code)
  ) +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  labs(title = indicator, x = "Expenditure 1990", y = "Expenditure 2014")

Figure 12.4 shows that the expenditure (relative to gross domestic product) is fairly correlated between the two time points: countries that spent more in 1990 also spent more in 2014 (specifically, the correlation—calculated in R using the cor() function—is .64).

A graph named "Government expenditure on education, total (percentage of GDP)" shows a comparison of each countrys education expenditures from 1990 to 2014. — Figure 12.4 A comparison of each country’s education expenditures in 1990 and 2014.

The horizontal axis of the graph represents "Expenditures in 1990" ranging from 0 to 12.5 in increments of 2.5. The vertical axis represents "Expenditures in 2014" ranging from 0 to 8 in increments of 2. The comparison involves 37 countries, where all the plotted values are less than 7.5 of the horizontal axis. In the region from (2.5, 6) to (7, 6), most countries are accumulated, hence forming a denser region. Other countries are randomly scattered around the denser region. For example, Sweden (SWE) is plotted at (5.2, 8), Finland (FIN) is at (5, 7.8), and so on, where the first value is for the year 1990 and the second value is for the year 2014. Note: All the values are marked approximately.

However, if you want to extend your analysis to visually compare how the expenditure across all years varies for a given country, you would need to reshape the data. Instead of having each observation be an indicator for a country, you want each observation to be an indicator for a country for a year—thereby having all of the values for all of the years in a single column and making the data long(er) format.

To do this, you can gather() the year columns together:

Click here to view code image

# Reshape the data to create a new column for the `year`
long_year_data <- wb_data %>%
  gather(
    key = year, # `year` will be the new key column
    value = value, # `value` will be the new value column
    X1960:X # all columns between `X1960` and `X` will be gathered
  )

As shown in Figure 12.5, this gather() statement creates a year column, so each observation (row) represents the value of an indicator in a particular country in a given year. The expenditure for each year is stored in the value column created (coincidentally, this column is given the name "value").

A screenshot shows a data frame representing the reshaped educational data. It has five columns: Country.Name, Country.Code, Indicator.Name, year, and value, and 11 rows. — Figure 12.5 Reshaped educational data (long format by `year`). This structure allows you to more easily create visualizations across multiple years.

This structure will now allow you to compare fluctuations in an indicator’s value over time (across all years):

Click here to view code image

# Filter the rows for the indicator and country of interest
indicator <- "Government expenditure on education, total (% of GDP)"
spain_plot_data <- long_year_data %>%
  filter(
    Indicator.Name == indicator,
    Country.Code == "ESP" # Spain
  ) %>%
  mutate(year = as.numeric(substr(year, 2, 5))) # remove "X" before each year

# Show the educational expenditure over time
chart_title <- paste(indicator, " in Spain")
spain_chart <- ggplot(data = spain_plot_data) +
  geom_line(mapping = aes(x = year, y = value / 100)) +
  scale_y_continuous(labels = percent) +
  labs(title = chart_title, x = "Year", y = "Percent of GDP Expenditure")

The resulting chart, shown in Figure 12.6, uses the available data to show a timeline of the fluctuations in government expenditures on education in Spain. This produces a more complete picture of the history of educational investment, and draws attention to major changes as well as the absence of data in particular years.

A discontinuous graph plot represents the fluctuations in government expenditures on education in Spain. — Figure 12.6 Education expenditures over time in Spain.

The horizontal axis of the plot represents "Year" ranging from 1960 to 2000 in increments of 20 years. The vertical axis reads "Percentage of GDP expenditure" ranging from 2 to 5 percent in increments of 1 percent. A trend line with smooth fluctuations is plotted with discontinuities from (1972, 2.1) to (1978, 1.8); (1978, 2.4) to (1984, 3.3); (1996, 4.5) to (1998, 4.3). The plot has an increasing trend overall with minor ups and downs. Note: All the values are marked approximately.

You may also want to compare two indicators to each other. For example, you may want to assess the relationship between each country’s literacy rate (a first indicator) and its unemployment rate (a second indicator). To do this, you would need to reshape the data again so that each observation is a particular country and each column is an indicator. Since indicators are currently in one column, you need to spread them out using the spread() function:

Click here to view code image

# Reshape the data to create columns for each indicator
wide_data <- long_year_data %>%
  select(-Indicator.Code) %>% # do not include the `Indicator.Code` column
  spread(
    key = Indicator.Name, # new column names are `Indicator.Name` values
    value = value # populate new columns with values from `value`
  )

This wide format data shape allows for comparisons between two different indicators. For example, you can explore the relationship between female unemployment and female literacy rates, as shown in Figure 12.7.

Click here to view code image

# Prepare data and filter for year of interest
x_var <- "Literacy rate, adult female (% of females ages 15 and above)"
y_var <- "Unemployment, female (% of female labor force) (modeled
  ILO estimate)"

lit_plot_data <- wide_data %>%
  mutate(
    lit_percent_2014 = wide_data[, x_var] / 100,
    employ_percent_2014 = wide_data[, y_var] / 100
  ) %>%
  filter(year == "X2014")

# Show the literacy vs. employment rates
lit_chart <- ggplot(data = lit_plot_data) +
  geom_point(mapping = aes(x = lit_percent_2014, y = employ_percent_2014)) +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  labs(
    x = x_var,
    y = "Unemployment, female (% of female labor force)",
    title = "Female Literacy Rate versus Female Unemployment Rate"
  )

A scatter plot represents the "Female Literacy Rate versus Female Unemployment Rate" in the year 2014. — Figure 12.7 Female literacy rate versus unemployment rate in 2014.

The horizontal axis of the plot represents "Literacy rate, adult female (percentage of females ages 15 and older)" ranging from 0 to 20 percent in increments of 20. The vertical axis represents "Unemployment, female (percentage of the female labor force)" ranging from 0 to 40 percent in increments of 10. The plots are randomly scattered. For literacy rate from 20 to 70, the labor force percent is in the range of 0 to 11. For literacy rate from 70 to 100, the labor force percent is in the range of 0 to 40. The points are dense in the corresponding region where literacy rate range is 80 to 100 and labor force range is 0 to 15.

Each comparison in this analysis—between two time points, over a full time-series, and between indicators—required a different representation of the data set. Mastering use of the tidyr functions will allow you to quickly transform the shape of your data set, allowing for rapid and effective data analysis. For practice reshaping data with the tidyr package, see the set of accompanying book exercises.⁵

⁵tidyr exercises: https://github.com/programming-for-data-science/chapter-12-exercises

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12 Reshaping Data with tidyr

Create new playlist

Sign In

Sign Up

12Reshaping Data with tidyr

12.1 What Is “Tidy” Data?

12.2 From Columns to Rows: gather()

12.3 From Rows to Columns: spread()

12.4 tidyr in Action: Exploring Educational Statistics

Table of Contents for
12 Reshaping Data with tidyr

12
Reshaping Data with `tidyr`

12.2 From Columns to Rows: `gather()`

12.3 From Rows to Columns: `spread()`

12.4 `tidyr` in Action: Exploring Educational Statistics