Hadley Wickham, PhD,1 one of the more prominent members of the R community, introduced the concept of tidy data in a Journal of Statistical Software paper.2 Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make your data analysis, visualization, and collection much easier.
1. Hadley Wickham, PhD: http://hadley.nz
2. Tidy Data paper: http://vita.had.co.nz/papers/tidy-data.pdf
What is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria: (1) Each row is an observation, (2) Each column is a variable, and (3) Each type of observational unit forms a table.
The newer definition from the R4DS book3 focuses on an individual data set (i.e., table):
3. R For Data Science Book: https://r4ds.had.co.nz/tidy-data.html
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
This chapter goes through the various ways to tidy data using examples from Wickham’s paper.
The concept map for this chapter can be found in Figure A.4.
Identify the components of tidy data
Identify common data errors
Use functions and methods to process and tidy data
Data used in this chapter will have NaN
missing values when they are loaded into Pandas (Chapter 9). In the raw CSV files, they will appear as empty values. I typically try to avoid forward referencing in the book, but I felt that the concept of tidy data warranted a much earlier place in the book because it is so fundamental to how we should be thinking about data technically (as opposed to ethically), that the chapter was moved toward the front of the book without having to cover more detailed data processing steps first. I could have changed the data sets such that there were no missing values, but opted not to do so because (1) it would no longer follow the data used in Wickam’s “Tidy Data” paper, and (2) it would be a less realistic data set.
Data can have columns that contain values instead of variables. This is usually a convenient format for data collection and presentation.
We’ll use data on income and religion in the United States from the Pew Research Center to illustrate how to work with columns that contain values, rather than variables.
import pandas as pd
pew = pd.read_csv('data/pew.csv')
When we look at this data set, we can see that not every column is a variable. The values that relate to income are spread across multiple columns. The format shown is a great choice when presenting data in a table, but for data analytics, the table should be reshaped so that we have religion
, income
, and count
variables.
# show only the first few columns
print(pew.iloc[:, 0:5])
religion <$10k $10-20k $20-30k $30-40k
0 Agnostic 27 34 60 81
1 Atheist 12 27 37 52
2 Buddhist 27 21 30 34
3 Catholic 418 617 732 670
4 Don’t know/refused 15 14 15 11
.. ... ... ... ... ...
13 Orthodox 13 17 23 32
14 Other Christian 9 7 11 13
15 Other Faiths 20 33 40 46
16 Other World Religions 5 2 3 4
17 Unaffiliated 217 299 374 365
[18 rows x 5 columns]
This view of the data is also known as “wide” data. To turn it into the “long” tidy data format, we will have to unpivot/melt/gather (depending on which statistical programming language we use) our dataframe.
Pandas DataFrame
s have a method called .melt()
that will reshape the dataframe into a tidy format and it takes a few parameters:
id_vars
is a container (list, tuple, ndarray) that represents the variables that will remain as is.
value_vars
identifies the columns you want to melt down (or unpivot). By default, it will melt all the columns not specified in the id_vars
parameter.
var_name
is a string for the new column name when the value_vars
is melted down. By default, it will be called variable
.
value_name
is a string for the new column name that represents the values for the var_name
. By default, it will be called value
.
# we do not need to specify a value_vars since we want to pivot
# all the columns except for the 'religion' column
pew_long = pew.melt(id_vars='religion')
print(pew_long)
religion variable value
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don't know/refused <$10k 15
.. ... ... ...
175 Orthodox Don't know/refused 73
176 Other Christian Don't know/refused 18
177 Other Faiths Don't know/refused 71
178 Other World Religions Don't know/refused 8
179 Unaffiliated Don't know/refused 597
[180 rows x 3 columns]
We can change the defaults so that the melted/unpivoted columns are named.
pew_long = pew.melt(
id_vars="religion", var_name="income", value_name="count"
)
print(pew_long)
religion income count
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don't know/refused <$10k 15
.. ... ... ...
175 Orthodox Don't know/refused 73
176 Other Christian Don't know/refused 18
177 Other Faiths Don't know/refused 71
178 Other World Religions Don't know/refused 8
179 Unaffiliated Don't know/refused 597
[180 rows x 3 columns]
Not every data set will have one column to hold still while you unpivot the rest of the columns. As an example, consider the Billboard data set.
billboard = pd.read_csv('data/billboard.csv')
# look at the first few rows and columns
print(billboard.iloc[0:5, 0:16])
year artist track time date.entered
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08
3 2000 3 Doors Down Loser 4:24 2000-10-21
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15
wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11
0 87 82.0 72.0 77.0 87.0 94.0 99.0 NaN NaN NaN NaN
1 91 87.0 92.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 81 70.0 68.0 67.0 66.0 57.0 54.0 53.0 51.0 51.0 51.0
3 76 76.0 72.0 69.0 67.0 65.0 55.0 59.0 62.0 61.0 61.0
4 57 34.0 25.0 17.0 17.0 31.0 36.0 49.0 53.0 57.0 64.0
You can see here that each week has its own column. Again, there is nothing wrong with this form of data. It may be easy to enter the data in this form, and it is much quicker to understand what it means when the data is presented in a table. However, there may be a time when you will need to melt the data. For example, if you wanted to create a faceted plot of the weekly ratings, the facet variable would need to be a column in the dataframe.
# use a list to reference more than 1 variable
billboard_long = billboard.melt(
id_vars=["year", "artist", "track", "time", "date.entered"],
var_name="week",
value_name="rating",
)
print(billboard_long)
year artist track time
0 2000 2 Pac Baby Don't Cry (Keep... 4:22
1 2000 2Ge+her The Hardest Part Of ... 3:15
2 2000 3 Doors Down Kryptonite 3:53
3 2000 3 Doors Down Loser 4:24
4 2000 504 Boyz Wobble Wobble 3:35
... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes 3:10
24088 2000 Yearwood, Trisha Real Live Woman 3:55
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19
24090 2000 Zombie Nation Kernkraft 400 3:30
24091 2000 matchbox twenty Bent 4:12
date.entered week rating
0 2000-02-26 wk1 87.0
1 2000-09-02 wk1 91.0
2 2000-04-08 wk1 81.0
3 2000-10-21 wk1 76.0
4 2000-04-15 wk1 57.0
... ... ... ...
24087 2000-04-29 wk76 NaN
24088 2000-04-01 wk76 NaN
24089 2000-03-18 wk76 NaN
24090 2000-09-02 wk76 NaN
24091 2000-04-29 wk76 NaN
[24092 rows x 7 columns]
Sometimes columns in a data set may represent multiple variables. This format is commonly seen when working with health data, for example. To illustrate this situation, let’s look at the Ebola data set.
ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.columns)
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia',
'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal',
'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali',
'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
'Deaths_Spain', 'Deaths_Mali'],
dtype='object')
# print select rows and columns
print(ebola.iloc[:5, [0, 1, 2,10]])
Date Day Cases_Guinea Deaths_Guinea
0 1/5/2015 289 2776.0 1786.0
1 1/4/2015 288 2775.0 1781.0
2 1/3/2015 287 2769.0 1767.0
3 1/2/2015 286 NaN NaN
4 12/31/2014 284 2730.0 1739.0
The column names Cases_Guinea
and Deaths_Guinea
actually contain two variables. The individual status (cases and deaths, respectively) as well as the country name, Guinea. The data is also arranged in a wide format that needs to be reshaped (with the .melt()
method).
First, let’s fix the problem we know how to fix, by melting the data into long format.
ebola_long = ebola.melt(id_vars=['Date', 'Day'])
print(ebola_long)
Date Day variable value
0 1/5/2015 289 Cases_Guinea 2776.0
1 1/4/2015 288 Cases_Guinea 2775.0
2 1/3/2015 287 Cases_Guinea 2769.0
3 1/2/2015 286 Cases_Guinea NaN
4 12/31/2014 284 Cases_Guinea 2730.0
... ... ... ... ...
1947 3/27/2014 5 Deaths_Mali NaN
1948 3/26/2014 4 Deaths_Mali NaN
1949 3/25/2014 3 Deaths_Mali NaN
1950 3/24/2014 2 Deaths_Mali NaN
1951 3/22/2014 0 Deaths_Mali NaN
[1952 rows x 4 columns]
Conceptually, the column of interest can be split based on the underscore in the column name, _
. The first part will be the new status column, and the second part will be the new country column. This will require some string parsing and splitting in Python (more on this in Chapter 11). In Python, a string is an object, similar to how Pandas has Series
and DataFrame
objects. Chapter 2 showed how Series
can have methods such as .mean()
, and DataFrame
s can have methods such as .to_csv()
. Strings have methods as well. In this case, we will use the .split()
method that takes a string and “splits” it up based on a given delimiter. By default, .split()
will split the string based on a space, but we can pass in the underscore, _
, in our example. To get access to the string methods, we need to use the .str.
attribute. .str.
is a special type of attribute that Pandas calls an “accessor” because it can “access” string methods (see Chapter 11 for more on strings). Access to the Python string methods and allow us to work across the entire column. This will be the key to parting out the multiple bits of information stored in each value.
We can use the .str
accessor to make a call to the .split()
method and pass in the _
understore.
# get the variable column
# access the string methods
# and split the column based on a delimiter
variable_split = ebola_long.variable.str.split('_')
print(variable_split[:5])
0 [Cases, Guinea]
1 [Cases, Guinea]
2 [Cases, Guinea]
3 [Cases, Guinea]
4 [Cases, Guinea]
Name: variable, dtype: object
After we split on the underscore, the values are returned in a list. We can tell it’s a list by:
Knowing about the .split()
method on base Python string objects4
Visually seeing the square brackets in the output, [ ]
Getting the type()
of one of the items in the Series
4. String .split()
documentation: https://docs.python.org/3/library/stdtypes.html#str.split
# the entire container
print(type(variable_split))
<class 'pandas.core.series.Series'>
# the first element in the container
print(type(variable_split[0]))
<class 'list'>
Now that the column has been split into various pieces, the next step is to assign those pieces to a new column. First, however, we need to extract all the 0-index elements for the status
column and the 1-index elements for the country column. To do so, we need to access the string methods again, and then use the .get()
method to “get” the index we want for each row.
status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)
print(status_values)
0 Cases
1 Cases
2 Cases
3 Cases
4 Cases
...
1947 Deaths
1948 Deaths
1949 Deaths
1950 Deaths
1951 Deaths
Name: variable, Length: 1952, dtype: object
Now that we have the vectors we want, we can add them to our dataframe.
ebola_long['status'] = status_values
ebola_long['country'] = country_values
print(ebola_long)
Date Day variable value status country
0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea
... ... ... ... ... ... ...
1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali
1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali
1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali
1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali
1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali
[1952 rows x 6 columns]
We can actually do the above steps in a single step. If we look at the .str.split()
method documentation (you can find this by looking by going to the Pandas API documentation > Series
> String Handling (.str.
) > .split()
method5), there is a parameter named expand
that defaults to False
, but when we set it to True
, it will return a DataFrame
where each result of the split is in a separate column, instead of a Series
of list
containers.
5. Series.str.split()
method documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html#pandas.Series.str.split
# reset our ebola_long data
ebola_long = ebola.melt(id_vars=['Date', 'Day'])
# split the column by _ into a dataframe using expand
variable_split = ebola_long.variable.str.split('_', expand=True)
print(variable_split)
0 1
0 Cases Guinea
1 Cases Guinea
2 Cases Guinea
3 Cases Guinea
4 Cases Guinea
... ... ...
1947 Deaths Mali
1948 Deaths Mali
1949 Deaths Mali
1950 Deaths Mali
1951 Deaths Mali
[1952 rows x 2 columns]
From here, we can actually use the Python and Pandas multiple assignment feature (Appendix Q), to directly assign the newly split columns into the original DataFrame
. Since our output variable_split
returned a DataFrame
with two columns, we can assign two new columns to our ebola_long DataFrame
.
ebola_long[['status', 'country']] = variable_split
print(ebola_long)
Date Day variable value status country
0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea
... ... ... ... ... ... ...
1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali
1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali
1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali
1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali
1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali
[1952 rows x 6 columns]
You can also opt to do this as a concatenation (pd.concat()
) function call as well (Chapter 6).
At times, data will be formatted so that variables are in both rows and columns – that is, in some combination of the formats described in previous sections of this chapter. Most of the methods needed to tidy up such data have already been presented (.melt()
and some string parsing with the .str.
accessor attribute). What is left to show is what happens if a column of data actually holds two variables instead of one variable. In this case, we will have to “pivot” the variable into separate columns, i.e., go from long data to wide data.
weather = pd.read_csv('data/weather.csv')
print(weather.iloc[:5, :11])
id year month element d1 d2 d3 d4 d5 d6 d7
0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN
1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN
2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN
3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN
4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN
The weather data include minimum (tmin
) and maximum (tmax
) temperatures recorded for each day (d1
, d2
, …, d31
) of the month (month). The element
column contains variables that need to be pivoted wider to become new columns, and the day variables need to be melted into row values.
Again, there is nothing wrong with the data in the current format. It is simply not in a shape amenable to analysis, although this kind of formatting can be helpful when presenting data in reports. Let’s first fix the day
values.
weather_melt = weather.melt(
id_vars=["id", "year", "month", "element"],
var_name="day",
value_name="temp",
)
print(weather_melt)
id year month element day temp
0 MX17004 2010 1 tmax d1 NaN
1 MX17004 2010 1 tmin d1 NaN
2 MX17004 2010 2 tmax d1 NaN
3 MX17004 2010 2 tmin d1 NaN
4 MX17004 2010 3 tmax d1 NaN
.. ... ... ... ... ... ...
677 MX17004 2010 10 tmin d31 NaN
678 MX17004 2010 11 tmax d31 NaN
679 MX17004 2010 11 tmin d31 NaN
680 MX17004 2010 12 tmax d31 NaN
681 MX17004 2010 12 tmin d31 NaN
[682 rows x 6 columns]
Next, we need to pivot up the variables stored in the element column.
weather_tidy = weather_melt.pivot_table(
index=['id', 'year', 'month', 'day'],
columns='element',
values='temp'
)
print(weather_tidy)
element tmax tmin
id year month day
MX17004 2010 1 d30 27.8 14.5
2 d11 29.7 13.4
d2 27.3 14.4
d23 29.9 10.7
d3 24.1 14.4
... ... ...
11 d27 27.7 14.2
d26 28.1 12.1
d4 27.2 12.0
12 d1 29.9 13.8
d6 27.8 10.5
[33 rows x 2 columns]
Looking at the pivoted table, we notice that each value in the element
column is now a separate column. We can leave this table in its current state, but we can also flatten the hierarchical columns.
weather_tidy_flat = weather_tidy.reset_index()
print(weather_tidy_flat)
element id year month day tmax tmin
0 MX17004 2010 1 d30 27.8 14.5
1 MX17004 2010 2 d11 29.7 13.4
2 MX17004 2010 2 d2 27.3 14.4
3 MX17004 2010 2 d23 29.9 10.7
4 MX17004 2010 2 d3 24.1 14.4
.. ... ... ... ... ... ...
28 MX17004 2010 11 d27 27.7 14.2
29 MX17004 2010 11 d26 28.1 12.1
30 MX17004 2010 11 d4 27.2 12.0
31 MX17004 2010 12 d1 29.9 13.8
32 MX17004 2010 12 d6 27.8 10.5
[33 rows x 6 columns]
Likewise, we can apply these methods without the intermediate dataframe:
weather_tidy = (
weather_melt
.pivot_table(
index=['id', 'year', 'month', 'day'],
columns='element',
values='temp')
.reset_index()
)
print(weather_tidy)
element id year month day tmax tmin
0 MX17004 2010 1 d30 27.8 14.5
1 MX17004 2010 2 d11 29.7 13.4
2 MX17004 2010 2 d2 27.3 14.4
3 MX17004 2010 2 d23 29.9 10.7
4 MX17004 2010 2 d3 24.1 14.4
.. ... ... ... ... ... ...
28 MX17004 2010 11 d27 27.7 14.2
29 MX17004 2010 11 d26 28.1 12.1
30 MX17004 2010 11 d4 27.2 12.0
31 MX17004 2010 12 d1 29.9 13.8
32 MX17004 2010 12 d6 27.8 10.5
[33 rows x 6 columns]
This chapter explored how we can reshape data into a format that is conducive to data analysis, visualization, and collection. We applied the concepts in Hadley Wickham’s “Tidy Data” paper to show the various functions and methods to reshape our data. This is an important skill because some functions need data to be organized into a certain shape, tidy or not, to work. Knowing how to reshape your data is an important skill for both the data scientist and the analyst.
18.118.0.145