4. Tidy Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4 Tidy Data

Hadley Wickham, PhD,¹ one of the more prominent members of the R community, introduced the concept of tidy data in a Journal of Statistical Software paper.² Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make your data analysis, visualization, and collection much easier.

1. Hadley Wickham, PhD: http://hadley.nz

2. Tidy Data paper: http://vita.had.co.nz/papers/tidy-data.pdf

What is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria: (1) Each row is an observation, (2) Each column is a variable, and (3) Each type of observational unit forms a table.

The newer definition from the R4DS book³ focuses on an individual data set (i.e., table):

3. R For Data Science Book: https://r4ds.had.co.nz/tidy-data.html

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

This chapter goes through the various ways to tidy data using examples from Wickham’s paper.

Learning Objectives

The concept map for this chapter can be found in Figure A.4.

Identify the components of tidy data
Identify common data errors
Use functions and methods to process and tidy data

Note About This Chapter

Data used in this chapter will have NaN missing values when they are loaded into Pandas (Chapter 9). In the raw CSV files, they will appear as empty values. I typically try to avoid forward referencing in the book, but I felt that the concept of tidy data warranted a much earlier place in the book because it is so fundamental to how we should be thinking about data technically (as opposed to ethically), that the chapter was moved toward the front of the book without having to cover more detailed data processing steps first. I could have changed the data sets such that there were no missing values, but opted not to do so because (1) it would no longer follow the data used in Wickam’s “Tidy Data” paper, and (2) it would be a less realistic data set.

4.1 Columns Contain Values, Not Variables

Data can have columns that contain values instead of variables. This is usually a convenient format for data collection and presentation.

4.1.1 Keep One Column Fixed

We’ll use data on income and religion in the United States from the Pew Research Center to illustrate how to work with columns that contain values, rather than variables.

Table of Contents for 4. Tidy Data

Create new playlist

Sign In

Sign Up

4

Tidy Data

Learning Objectives

Note About This Chapter

4.1 Columns Contain Values, Not Variables

4.1.1 Keep One Column Fixed

4.1.2 Keep Multiple Columns Fixed

4.2 Columns Contain Multiple Variables

4.2.1 Split and Add Columns Individually

4.2.2 Split and Combine in a Single Step

4.3 Variables in Both Rows and Columns

Conclusion

Table of Contents for
4. Tidy Data