As mentioned in Chapter 4, Hadley Wickham,1 one of the more prominent members of the R community, introduced the concept of tidy data in a paper in the Journal of Statistical Software.2 Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make your data analysis, visualization, and collection much easier.
1. Hadley Wickham: http://hadley.nz/
2. Tidy data paper: http://vita.had.co.nz/papers/tidy-data.pdf
What is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria:
■ Each row is an observation.
■ Each column is a variable.
■ Each type of observational unit forms a table.
This chapter goes through the various ways to tidy data as identified in Wickham’s paper.
Prior knowledge:
a. function and method calls
b. subsetting data
c. loops
d. list comprehension
This chapter:
■ Reshaping data
a. unpivot/melt/gather
b. pivot/cast/spread
d. combining
1. globbing
2. concatenation
This chapter will cover:
1. Unpivoting/melting/gathering columns into rows
2. Pivoting/casting/spreading rows into columns
3. Normalizing data by separating a dataframe into multiple tables
4. Assembling data from multiple parts
Data can have columns that contain values instead of variables. This is usually a convenient format for data collection and presentation.
We’ll use data on income and religion in the United States from the Pew Research Center to illustrate how to work with columns that contain values, rather than variables.
import pandas as pd
pew = pd.read_csv('../data/pew.csv')
When we look at this data set, we can see that not every column is a variable. The values that relate to income are spread across multiple columns. The format shown is a great choice when presenting data in a table, but for data analytics, the table needs to be reshaped so that we have religion, income, and count variables.
# show only the first few columns
print(pew.iloc[:, 0:6])
religion <$10k $10-20k $20-30k $30-40k
0 Agnostic 27 34 60 81
1 Atheist 12 27 37 52
2 Buddhist 27 21 30 34
3 Catholic 418 617 732 670
4 Don't know/refused 15 14 15 11
5 Evangelical Prot 575 869 1064 982
6 Hindu 1 9 7 9
7 Historically Black Prot 228 244 236 238
8 Jehovah's Witness 20 27 24 24
9 Jewish 19 19 25 25
10 Mainline Prot 289 495 619 655
11 Mormon 29 40 48 51
12 Muslim 6 7 9 10
13 Orthodox 13 17 23 32
14 Other Christian 9 7 11 13
15 Other Faiths 20 33 40 46
16 Other World Religions 5 2 3 4
17 Unaffiliated 217 299 374 365
$40-50k
0 76
1 35
2 33
3 638
4 10
5 881
6 11
7 197
8 21
9 30
10 651
11 56
12 9
13 32
14 13
15 49
16 2
17 341
This view of the data is also known as “wide” data. To turn it into the “long” tidy data format, we will have to unpivot/melt/gather (depending on which statistical programming language we use) our dataframe. Pandas has a function called melt
that will reshape the dataframe into a tidy format. melt
takes a few parameters:
■ id_vars
is a container (list, tuple, ndarray) that represents the variables that will remain as is.
■ value_vars
identifies the columns you want to melt down (or unpivot). By default, it will melt all the columns not specified in the id_vars
parameter.
■ var_name
is a string for the new column name when the value_vars
is melted down. By default, it will be called variable
.
■ value_name
is a string for the new column name that represents the values for the var_name
. By default, it will be called value
.
# we do not need to specify a value_vars since we want to pivot
# all the columns except for the 'religion' column
pew_long = pd.melt(pew, id_vars='religion')
print(pew_long.head())
religion variable value
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don't know/refused <$10k 15
print(pew_long.tail())
religion variable value
175 Orthodox Don't know/refused 73
176 Other Christian Don't know/refused 18
177 Other Faiths Don't know/refused 71
178 Other World Religions Don't know/refused 8
179 Unaffiliated Don't know/refused 597
We can change the defaults so that the melted/unpivoted columns are named.
pew_long = pd.melt(pew,
id_vars='religion',
var_name='income',
value_name='count')
print(pew_long.head())
religion income count
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don't know/refused <$10k 15
print(pew_long.tail())
religion income count
175 Orthodox Don't know/refused 73
176 Other Christian Don't know/refused 18
177 Other Faiths Don't know/refused 71
178 Other World Religions Don't know/refused 8
179 Unaffiliated Don't know/refused 597
Not every data set will have one column to hold still while you unpivot the rest of the columns. As an example, consider the Billboard data set.
billboard = pd.read_csv('../data/billboard.csv')
# look at the first few rows and columns
print(billboard.iloc[0:5, 0:16])
year artist track time date.entered
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08
3 2000 3 Doors Down Loser 4:24 2000-10-21
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15
wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11
0 87 82.0 72.0 77.0 87.0 94.0 99.0 NaN NaN NaN NaN
1 91 87.0 92.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 81 70.0 68.0 67.0 66.0 57.0 54.0 53.0 51.0 51.0 51.0
3 76 76.0 72.0 69.0 67.0 65.0 55.0 59.0 62.0 61.0 61.0
4 57 34.0 25.0 17.0 17.0 31.0 36.0 49.0 53.0 57.0 64.0
You can see here that each week has its own column. Again, there is nothing wrong with this form of data. It may be easy to enter the data in this form, and it is much quicker to understand what it means when the data is presented in a table. However, there may be a time when you will need to melt the data. For example, if you wanted to create a faceted plot of the weekly ratings, the facet variable would need to be a column in the dataframe.
billboard_long = pd.melt(
billboard,
id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
var_name='week',
value_name='rating')
print(billboard_long.head())
year artist track time date.entered
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08
3 2000 3 Doors Down Loser 4:24 2000-10-21
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15
week rating
0 wk1 87.0
1 wk1 91.0
2 wk1 81.0
3 wk1 76.0
4 wk1 57.0
print(billboard_long.tail())
year artist track time
24087 2000 Yankee Grey Another Nine Minutes 3:10
24088 2000 Yearwood, Trisha Real Live Woman 3:55
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19
24090 2000 Zombie Nation Kernkraft 400 3:30
24091 2000 matchbox twenty Bent 4:12
date.entered week rating
24087 2000-04-29 wk76 NaN
24088 2000-04-01 wk76 NaN
24089 2000-03-18 wk76 NaN
24090 2000-09-02 wk76 NaN
24091 2000-04-29 wk76 NaN
Sometimes columns in a data set may represent multiple variables. This format is commonly seen when working with health data, for example. To illustrate this situation, let’s look at the Ebola data set.
ebola = pd.read_csv('../data/country_timeseries.csv')
print(ebola.columns)
Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia',
'Cases_SierraLeone', 'Cases_Nigeria', 'Cases_Senegal',
'Cases_UnitedStates', 'Cases_Spain', 'Cases_Mali',
'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
'Deaths_Spain', 'Deaths_Mali'],
dtype='object')
# print select rows
print(ebola.iloc[:5, [0, 1, 2, 3, 10, 11]])
Date Day Cases_Guinea Cases_Liberia Deaths_Guinea
0 1/5/2015 289 2776.0 NaN 1786.0
1 1/4/2015 288 2775.0 NaN 1781.0
2 1/3/2015 287 2769.0 8166.0 1767.0
3 1/2/2015 286 NaN 8157.0 NaN
4 12/31/2014 284 2730.0 8115.0 1739.0
Deaths_Liberia
0 NaN
1 NaN
2 3496.0
3 3496.0
4 3471.0
The column names Cases_Guinea
and Deaths_Guinea
actually contain two variables. The individual status (cases and deaths, respectively) as well as the country name, Guinea. The data is also arranged in a wide format that needs to be unpivoted.
ebola_long = pd.melt(ebola, id_vars=['Date', 'Day'])
print(ebola_long.head())
Date Day variable value
0 1/5/2015 289 Cases_Guinea 2776.0
1 1/4/2015 288 Cases_Guinea 2775.0
2 1/3/2015 287 Cases_Guinea 2769.0
3 1/2/2015 286 Cases_Guinea NaN
4 12/31/2014 284 Cases_Guinea 2730.0
print(ebola_long.tail())
Date Day variable value
1947 3/27/2014 5 Deaths_Mali NaN
1948 3/26/2014 4 Deaths_Mali NaN
1949 3/25/2014 3 Deaths_Mali NaN
1950 3/24/2014 2 Deaths_Mali NaN
1951 3/22/2014 0 Deaths_Mali NaN
Conceptually, the column of interest can be split based on the underscore in the column name, _
. The first part will be the new status column, and the second part will be the new country column. This will require some string parsing and splitting in Python (more on this in Chapter 8). In Python, a string is an object, similar to how Pandas has Series
and DataFrame
objects. Chapter 2 showed how Series
can have method such as mean
, and DataFrame
s can have methods such as to_csv
. Strings have methods as well. In this case we will use the split
method that takes a string and splits the string up based on a given delimiter. By default, split
will split the string based on a space, but we can pass in the underscore, _
, in our example. To get access to the string methods, we need to use the str
accessor (see Chapter 8 for more on strings). This will give us access to the Python string methods and allow us to work across the entire column.
# get the variable column
# access the string methods
# and split the column based on a delimiter
variable_split = ebola_long.variable.str.split('_')
print(variable_split[:5])
0 [Cases, Guinea]
1 [Cases, Guinea]
2 [Cases, Guinea]
3 [Cases, Guinea]
4 [Cases, Guinea]
Name: variable, dtype: object
print(variable_split[-5:])
1947 [Deaths, Mali]
1948 [Deaths, Mali]
1949 [Deaths, Mali]
1950 [Deaths, Mali]
1951 [Deaths, Mali]
Name: variable, dtype: object
After we split on the underscore, the values are returned in a list. We know it’s a list because that’s how the split method works,3 but the visual cue is that the results are surrounded by square brackets.
3. String split
documentation: https://docs.python.org/3.6/library/stdtypes.html#str.split
# the entire container
print(type(variable_split))
<class 'pandas.core.series.Series'>
# the first element in the container
print(type(variable_split[0]))
<class 'list'>
Now that the column has been split into the various pieces, the next step is to assign those pieces to a new column. First, however, we need to extract all the 0-index elements for the status
column and the 1-index elements for the country column. To do so, we need to access the string methods again, and then use the get
method to get the index we want for each row.
status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)
print(status_values[:5])
0 Cases
1 Cases
2 Cases
3 Cases
4 Cases
Name: variable, dtype: object
print(status_values[-5:])
1947 Deaths
1948 Deaths
1949 Deaths
1950 Deaths
1951 Deaths
Name: variable, dtype: object
print(country_values[:5])
0 Guinea
1 Guinea
2 Guinea
3 Guinea
4 Guinea
Name: variable, dtype: object
print(country_values[-5:])
1947 Mali
1948 Mali
1949 Mali
1950 Mali
1951 Mali
Name: variable, dtype: object
Now that we have the vectors we want, we can add them to our dataframe.
ebola_long['status'] = status_values
ebola_long['country'] = country_values
Date Day variable value status country
0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea
In this subsection, we’ll exploit the fact that the vector returned is in the same order as our data. We can concatenate (see Chapter 4) the new vector or our original data.
variable_split = ebola_long.variable.str.split('_', expand=True)
variable_split.columns = ['status', 'country']
ebola_parsed = pd.concat([ebola_long, variable_split], axis=1)
print(ebola_parsed.head())
Date Day variable value status country status
0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea Cases
1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea Cases
2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea Cases
3 1/2/2015 286 Cases_Guinea NaN Cases Guinea Cases
4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea Cases
country
0 Guinea
1 Guinea
2 Guinea
3 Guinea
4 Guinea
print(ebola_parsed.tail())
Date Day variable value status country status
1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali Deaths
1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali Deaths
1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali Deaths
1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali Deaths
1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali Deaths
country
1947 Mali
1948 Mali
1949 Mali
1950 Mali
1951 Mali
In this subsection, we’ll again exploit the fact that the vector returned is in the same order as our data. We can concatenate (see Chapter 4) the new vector or our original data.
We can accomplish the same result in a single step by taking advantage of the fact that the split results return a list of two elements, where each element is a new column. We can combine the list of split items with the built-in zip
function. zip
takes a set of iterators (e.g., lists, tuples) and creates a new container that is made of the input iterators, but each new container created has the same index as the input containers. For example, if we have two lists of values,
constants = ['pi', 'e']
values = ['3.14', '2.718']
we can zip the values together:
# we have to call list on the zip function
# to show the contents of the zip object
# in Python 3, zip returns an iterator
print(list(zip(constants, values)))
[('pi', '3.14'), ('e', '2.718')]
Each element now has the constant matched with its corresponding value. Conceptually, each container is like a side of a zipper. When we zip
the containers, the indices are matched up and returned.
Another way to visualize what zip
is doing is taking each container passed into zip
and stacking the containers on top of each other (think about the row-wise concatenation described in Section 4.3.1), thereby creating a dataframe of sorts. zip
then returns the values on a column-by-column basis in a tuple.
We can use the same ebola_long.variable.str.split(' ')
to split the values in the column. However, since the result is already a container (a Series
object), we need to unpack it so that we have the contents of the container (each status–country list), rather than the container itself (the series).
In Python, the asterisk operator, *
, is used to unpack containers.4 When we zip
the unpacked containers, the effect is the same as when we created the status values and the country values earlier. We can then assign the vectors to the columns simultaneously using multiple assignment (Appendix Q).
4. Unpacking argument lists: https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists
ebola_long['status'], ebola_long['country'] =
zip(*ebola_long.variable.str.split('_'))
print(ebola_long.head())
Date Day variable value status country
0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea
At times data will be formatted so that variables are in both rows and columns—that is, in some combination of the formats described in previous sections of this chapter. Most of the methods needed to tidy up such data have already been presented. What is left to show is what happens if a column of data actually holds two variables instead of one variable. In this case, we will have to pivot or cast the variable into separate columns.
weather = pd.read_csv('../data/weather.csv')
print(weather.iloc[:5, :11])
id year month element d1 d2 d3 d4 d5 d6 d7
0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN
1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN
2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN
3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN
4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN
The weather data include minimum and maximum (tmin
and tmax
values in the element column, respectively) temperatures recorded for each day (d1
, d2
, ..., d31
)ofthe month (month). The element
column contains variables that need to be casted/pivoted to become new columns, and the day variables need to be melted into row values. Again, there is nothing wrong with the data in the current format. It is simply not in a shape amenable to analysis, although this kind of formatting can be helpful when presenting data in reports. Let’s first melt/unpivot the day values.
weather_melt = pd.melt(weather,
id_vars=['id', 'year', 'month', 'element'],
var_name='day',
value_name='temp')
print(weather_melt.head())
id year month element day temp
0 MX17004 2010 1 tmax d1 NaN
1 MX17004 2010 1 tmin d1 NaN
2 MX17004 2010 2 tmax d1 NaN
3 MX17004 2010 2 tmin d1 NaN
4 MX17004 2010 3 tmax d1 NaN
print(weather_melt.tail())
id year month element day temp
677 MX17004 2010 10 tmin d31 NaN
678 MX17004 2010 11 tmax d31 NaN
679 MX17004 2010 11 tmin d31 NaN
680 MX17004 2010 12 tmax d31 NaN
681 MX17004 2010 12 tmin d31 NaN
Next, we need to pivot up the variables stored in the element column. This process is referred to as casting or spreading in other statistical languages. One of the main differences between pivot_table
and melt
is that melt
is a function within Pandas, whereas pivot_table
is a method we call on a DataFrame
object.
weather_tidy = weather_melt.pivot_table(
index=['id', 'year', 'month', 'day'],
columns='element',
values='temp')
Looking at the pivoted table, we notice that each value in the element
column is now a separate column. We can leave this table in its current state, but we can also flatten the hierarchical columns.
weather_tidy_flat = weather_tidy.reset_index()
print(weather_tidy_flat.head())
element id year month day tmax tmin
0 MX17004 2010 1 d1 NaN NaN
1 MX17004 2010 1 d10 NaN NaN
2 MX17004 2010 1 d11 NaN NaN
3 MX17004 2010 1 d12 NaN NaN
4 MX17004 2010 1 d13 NaN NaN
Likewise, we can apply these methods without the intermediate dataframe:
weather_tidy = weather_melt.
pivot_table(
index=['id', 'year', 'month', 'day'],
columns='element',
values='temp').
reset_index()
print(weather_tidy.head())
element id year month day tmax tmin
0 MX17004 2010 1 d1 NaN NaN
1 MX17004 2010 1 d10 NaN NaN
2 MX17004 2010 1 d11 NaN NaN
3 MX17004 2010 1 d12 NaN NaN
4 MX17004 2010 1 d13 NaN NaN
One of the simplest ways of knowing whether multiple observational units are represented in a table is by looking at each of the rows, and taking note of any cells or values that are being repeated from row to row. This is very common in government education administration data, where student demographics are reported for each student for each year the student is enrolled.
Let’s look again at the Billboard data we cleaned in Section 6.2.2.
print(billboard_long.head())
year artist track time date.entered
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08
3 2000 3 Doors Down Loser 4:24 2000-10-21
4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15
week rating
0 wk1 87.0
1 wk1 91.0
2 wk1 81.0
3 wk1 76.0
4 wk1 57.0
Suppose we subset (Section 2.4.1) the data based on a particular track:
print(billboard_long[billboard_long.track == 'Loser'].head())
year artist track time date.entered week rating
3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0
320 2000 3 Doors Down Loser 4:24 2000-10-21 wk2 76.0
637 2000 3 Doors Down Loser 4:24 2000-10-21 wk3 72.0
954 2000 3 Doors Down Loser 4:24 2000-10-21 wk4 69.0
1271 2000 3 Doors Down Loser 4:24 2000-10-21 wk5 67.0
We can see that this table actually holds two types of data: the track information and the weekly ranking. It would be better to store the track information in a separate table. This way, the information stored in the year
, artist
, track
, and time
columns would not be repeated in the data set. This consideration is particularly important if the data is manually entered. Repeating the same values over and over during data entry increases the risk of inconsistent data.
What we should do in this case is to place the year
, artist
, track
, time
, and date.entered
in a new dataframe, with each unique set of values being assigned a unique ID. We can then use this unique ID in a second dataframe that represents a song, date, week number, and ranking. This entire process can be thought of as reversing the steps in concatenating and merging data described in Chapter 4.
billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
print(billboard_songs.shape)
(24092, 4)
We know there are duplicate entries in this dataframe, so we need to drop the duplicate rows.
billboard_songs = billboard_songs.drop_duplicates()
print(billboard_songs.shape)
(317, 4)
We can then assign a unique value to each row of data.
billboard_songs['id'] = range(len(billboard_songs))
print(billboard_songs.head(n=10))
year artist track time id
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 0
1 2000 2Ge+her The Hardest Part Of ... 3:15 1
2 2000 3 Doors Down Kryptonite 3:53 2
3 2000 3 Doors Down Loser 4:24 3
4 2000 504 Boyz Wobble Wobble 3:35 4
5 2000 98^0 Give Me Just One Nig... 3:24 5
6 2000 A*Teens Dancing Queen 3:44 6
7 2000 Aaliyah I Don't Wanna 4:15 7
8 2000 Aaliyah Try Again 4:03 8
9 2000 Adams, Yolanda Open My Heart 5:30 9
Now that we have a separate dataframe about songs, we can use the newly created id
column to match a song to its weekly ranking.
# Merge the song dataframe to the original data set
billboard_ratings = billboard_long.merge(
billboard_songs, on=['year', 'artist', 'track', 'time'])
print(billboard_ratings.shape)
(24092, 8)
print(billboard_ratings.head())
year artist track time date.entered week
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1
1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk2
2 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk3
3 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk4
4 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk5
rating id
0 87.0 0
1 82.0 0
2 72.0 0
3 77.0 0
4 87.0 0
Finally, we subset the columns to the ones we want in our ratings dataframe.
billboard_ratings =
billboard_ratings[['id', 'date.entered', 'week', 'rating']]
print(billboard_ratings.head())
id date.entered week rating
0 0 2000-02-26 wk1 87.0
1 0 2000-02-26 wk2 82.0
2 0 2000-02-26 wk3 72.0
3 0 2000-02-26 wk4 77.0
4 0 2000-02-26 wk5 87.0
The last bit of data tidying relates to the situation in which the same type of data is spread across multiple data sets. This issue was also covered in Chapter 4, when we discussed data concatenation and merging. One reason why data might be split across multiple files would be the size of the files. By splitting up data into various parts, each part would be smaller. This may be good when we need to share data on the Internet or via email, since many services limit the size of a file that can be opened or shared. Another reason why a data set might be split into multiple parts would be to account for the data collection process. For example, a separate data set containing stock information could be created for each day.
Since merging and concatenation have already been covered, this section will focus on techniques for quickly loading multiple data sources and assembling them together.
The Unified New York City Taxi and Uber Data is a good choice to illustrate these processes. The entire data set contains data on more than 1.3 billion taxi and Uber trips from New York City, and is organized into more than 140 files. For illustration purposes, we will work with only five of these data files. When the same data is broken into multiple parts, those parts typically have a structured naming pattern associated with them.
First let’s download the data. Do not worry too much about the details in the following block of code. The raw_data_urls.txt
file contain a list of URLs where each URL is the download link to a part of the taxi data. We begin by opening and reading the file, and iterating through each line of the file (i.e., each data URL). We download only the first 5 data sets since the files are fairly large. We use some string manipulation (Chapter 8) to create the path where the data will be saved, and use the urllib
library to download our data.
import os
import urllib
# code to download the data
# download only the first 5 data sets from the list of files
with open('../data/raw_data_urls.txt', 'r') as data_urls:
for line, url
in enumerate(data_urls):
if line == 5:
break
fn = url.split('/')[-1].strip()
fp = os.path.join('..', 'data', fn)
print(url)
print(fp)
urllib.request.urlretrieve(url, fp)
In this example, all of the raw taxi trips have the pattern fhv_tripdata_YYYY_XX.csv
, where YYYY
represents the year (e.g., 2015), and XX
represents the part number. We can use the a simple pattern matching function from the glob library in Python to get a list of all the filenames that match a particular pattern.
import glob
# get a list of the csv files from the nyc-taxi data folder
nyc_taxi_data = glob.glob('../data/fhv_*')
print(nyc_taxi_data)
['../data/fhv_tripdata_2015-04.csv',
'../data/fhv_tripdata_2015-05.csv',
'../data/fhv_tripdata_2015-03.csv',
'../data/fhv_tripdata_2015-01.csv',
'../data/fhv_tripdata_2015-02.csv']
Now that we have a list of filenames we want to load, we can load each file into a dataframe. We can choose to load each file individually, as we have been doing so far.
taxi1 = pd.read_csv(nyc_taxi_data[0])
taxi2 = pd.read_csv(nyc_taxi_data[1])
taxi3 = pd.read_csv(nyc_taxi_data[2])
taxi4 = pd.read_csv(nyc_taxi_data[3])
taxi5 = pd.read_csv(nyc_taxi_data[4])
We can look at our data and see how they can be nicely stacked (concatenated) on top of each other.
print(taxi1.head(n=2))
print(taxi2.head(n=2))
print(taxi3.head(n=2))
print(taxi4.head(n=2))
print(taxi5.head(n=2))
Dispatching_base_num Pickup_date locationID
0 B00001 2015-04-01 04:30:00 NaN
1 B00001 2015-04-01 06:00:00 NaN
Dispatching_base_num Pickup_date locationID
0 B00001 2015-05-01 04:30:00 NaN
1 B00001 2015-05-01 05:00:00 NaN
Dispatching_base_num Pickup_date locationID
0 B00029 2015-03-01 00:02:00 213.0
1 B00029 2015-03-01 00:03:00 51.0
Dispatching_base_num Pickup_date locationID
0 B00013 2015-01-01 00:30:00 NaN
1 B00013 2015-01-01 01:22:00 NaN
Dispatching_base_num Pickup_date locationID
0 B00013 2015-02-01 00:00:00 NaN
1 B00013 2015-02-01 00:01:00 NaN
We can concatenate them just as we did in Chapter 4.
# shape of each dataframe
print(taxi1.shape)
print(taxi2.shape)
print(taxi3.shape)
print(taxi4.shape)
print(taxi5.shape)
(3917789, 3)
(4296067, 3)
(3281427, 3)
(2746033, 3)
(3126401, 3)
# concatenate the dataframes together
taxi = pd.concat([taxi1, taxi2, taxi3, taxi4, taxi5])
# shape of final concatenated taxi data
print(taxi.shape)
(17367717, 3)
However, manually saving each dataframe will get tedious when the data is split into many parts. As an alternative approach, we can automate the process using loops and list comprehensions.
An easier way to load multiple files is to first create an empty list, use a loop to iterate though each of the CSV files, load the CSV files into a Pandas dataframe, and finally append the dataframe to the list. The final type of data we want is a list of dataframes because the concat
function takes a list of dataframes to concatenate.
# create an empty list to append to
list_taxi_df = []
# loop though each CSV filename
for csv_filename
in nyc_taxi_data:
# you can choose to print the filename for debugging
# print(csv_filename)
# load the CSV file into a dataframe
df = pd.read_csv(csv_filename)
# append the dataframe to the list that will hold the dataframes
list_taxi_df.append(df)
# print the length of the dataframe
print(len(list_taxi_df))
5
# type of the first element
print(type(list_taxi_df[0]))
<class 'pandas.core.frame.DataFrame'>
# look at the head of the first dataframe
print(list_taxi_df[0].head())
Dispatching_base_num Pickup_date locationID
0 B00001 2015-04-01 04:30:00 NaN
1 B00001 2015-04-01 06:00:00 NaN
2 B00001 2015-04-01 06:00:00 NaN
3 B00001 2015-04-01 06:00:00 NaN
4 B00001 2015-04-01 06:15:00 NaN
Now that we have a list of dataframes, we can concatenate them.
taxi_loop_concat = pd.concat(list_taxi_df)
print(taxi_loop_concat.shape)
(17367717, 3)
# Did we get the same results as the manual load and concatenation?
print(taxi.equals(taxi_loop_concat))
True
Python has an idiom for looping though something and adding it to a list, called a list comprehension. The loop given previously, which is shown here again without the comments, can be written in a list comprehension (Appendix N).
# the loop code without comments
list_taxi_df = []
for csv_filename
in nyc_taxi_data:
df = pd.read_csv(csv_filename)
list_taxi_df.append(df)
# same code in a list comprehension
list_taxi_df_comp = [pd.read_csv(data) for data
in nyc_taxi_data]
The result from our list comprehension is a list, just as the earlier loop example.
print(type(list_taxi_df_comp))
<class 'list'>
Finally, we can concatenate the results just as we did earlier.
taxi_loop_concat_comp = pd.concat(list_taxi_df_comp)
# are the concatenated dataframes the same?
print(taxi_loop_concat_comp.equals(taxi_loop_concat))
True
This chapter explored how we can reshape data into a format that is conducive tp data analysis, visualization, and collection. We applied the concepts in Hadley Wickham’s Tidy Data paper to show the various functions and methods to reshape our data. This is an important skill because some functions need data to be organized into a certain shape, tidy or not, to work. Knowing how to reshape your data is an important skill for both the data scientist and the analyst.
18.220.155.243