The final point in the original “Tidy Data” paper stated that for data to be tidy “… each type of observational unit forms a table.” However, usually we need to combine multiple data sets together so we can do an analysis (Chapter 6). But when we think about how to store and manage data in a way where we reduce the amount of duplication and potential for errors, we should try to normalize our data into separate tables so a single fix can propagate when we combine the data together again.
Identify the differences between tidy data and data normalization
Apply data subsetting to split data into normalized parts
One of the simplest ways of knowing whether multiple observational units are represented in a table is by looking at each of the rows and taking note of any cells or values that are being repeated from row to row. This is very common in government education administration data, where student demographics are reported for each student for each year the student is enrolled, and in other data sets that track a value over time.
Let’s look again at the Billboard data we cleaned in Section 4.1.2.
import pandas as pd
billboard = pd.read_csv('data/billboard.csv')
billboard_long = billboard.melt(
id_vars=["year", "artist", "track", "time", "date.entered"],
var_name="week",
value_name="rating",
)
print(billboard_long)
year artist track time
0 2000 2 Pac Baby Don't Cry (Keep... 4:22
1 2000 2Ge+her The Hardest Part Of ... 3:15
2 2000 3 Doors Down Kryptonite 3:53
3 2000 3 Doors Down Loser 4:24
4 2000 504 Boyz Wobble Wobble 3:35
... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes 3:10
24088 2000 Yearwood, Trisha Real Live Woman 3:55
24089 2000 Ying Yang Twins Whistle While You Tw... 4:19
24090 2000 Zombie Nation Kernkraft 400 3:30
24091 2000 matchbox twenty Bent 4:12
date.entered week rating
0 2000-02-26 wk1 87.0
1 2000-09-02 wk1 91.0
2 2000-04-08 wk1 81.0
3 2000-10-21 wk1 76.0
4 2000-04-15 wk1 57.0
... ... ... ...
24087 2000-04-29 wk76 NaN
24088 2000-04-01 wk76 NaN
24089 2000-03-18 wk76 NaN
24090 2000-09-02 wk76 NaN
24091 2000-04-29 wk76 NaN
[24092 rows x 7 columns]
Suppose we subset the data based on a particular track:
print(billboard_long.loc[billboard_long.track == 'Loser'])
year artist track time date.entered week rating
3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0
320 2000 3 Doors Down Loser 4:24 2000-10-21 wk2 76.0
637 2000 3 Doors Down Loser 4:24 2000-10-21 wk3 72.0
954 2000 3 Doors Down Loser 4:24 2000-10-21 wk4 69.0
1271 2000 3 Doors Down Loser 4:24 2000-10-21 wk5 67.0
... ... ... ... ... ... ... ...
22510 2000 3 Doors Down Loser 4:24 2000-10-21 wk72 NaN
22827 2000 3 Doors Down Loser 4:24 2000-10-21 wk73 NaN
23144 2000 3 Doors Down Loser 4:24 2000-10-21 wk74 NaN
23461 2000 3 Doors Down Loser 4:24 2000-10-21 wk75 NaN
23778 2000 3 Doors Down Loser 4:24 2000-10-21 wk76 NaN
[76 rows x 7 columns]
We can see that this table actually holds two types of data: the track information and the weekly ranking. It would be better to store the track information in a separate table. This way, the information stored in the year
, artist
, track
, and time
columns would not be repeated in the data set. This consideration is particularly important if the data is manually entered. Repeating the same values over and over during data entry increases the risk of inconsistent data.
We can place the year
, artist
, track
, and time
in a new dataframe, with each unique set of values being assigned a unique ID. We can then use this unique ID in a second dataframe that represents a date entered, song, date, week number, and ranking. This entire process can be thought of as reversing the steps in concatenating and merging data described in Chapter 6.
billboard_songs = billboard_long[
["year", "artist", "track", "time"]
]
print(billboard_songs.shape)
(24092, 4)
We know there are duplicate entries in this dataframe, so we need to drop the duplicate rows.
billboard_songs = billboard_songs.drop_duplicates()
print(billboard_songs.shape)
(317, 4)
We can then assign a unique value to each row of data. There are many ways you could do this, there we take the index value and add 1
so it doesn’t start with 0
.
billboard_songs['id'] = billboard_songs.index + 1
print(billboard_songs)
year artist track time id
0 2000 2 Pac Baby Don't Cry (Keep... 4:22 1
1 2000 2Ge+her The Hardest Part Of ... 3:15 2
2 2000 3 Doors Down Kryptonite 3:53 3
3 2000 3 Doors Down Loser 4:24 4
4 2000 504 Boyz Wobble Wobble 3:35 5
.. ... ... ... ... ...
312 2000 Yankee Grey Another Nine Minutes 3:10 313
313 2000 Yearwood, Trisha Real Live Woman 3:55 314
314 2000 Ying Yang Twins Whistle While You Tw... 4:19 315
315 2000 Zombie Nation Kernkraft 400 3:30 316
316 2000 matchbox twenty Bent 4:12 317
[317 rows x 5 columns]
Now that we have a separate dataframe about songs, we can use the newly created id
column to match a song to its weekly ranking.
# Merge the song dataframe to the original data set
billboard_ratings = billboard_long.merge(
billboard_songs, on=["year", "artist", "track", "time"]
)
print(billboard_ratings.shape)
(24092, 8)
print(billboard_ratings)
year artist track time
0 2000 2 Pac Baby Don't Cry (Keep... 4:22
1 2000 2 Pac Baby Don't Cry (Keep... 4:22
2 2000 2 Pac Baby Don't Cry (Keep... 4:22
3 2000 2 Pac Baby Don't Cry (Keep... 4:22
4 2000 2 Pac Baby Don't Cry (Keep... 4:22
... ... ... ... ...
24087 2000 matchbox twenty Bent 4:12
24088 2000 matchbox twenty Bent 4:12
24089 2000 matchbox twenty Bent 4:12
24090 2000 matchbox twenty Bent 4:12
24091 2000 matchbox twenty Bent 4:12
date.entered week rating id
0 2000-02-26 wk1 87.0 1
1 2000-02-26 wk2 82.0 1
2 2000-02-26 wk3 72.0 1
3 2000-02-26 wk4 77.0 1
4 2000-02-26 wk5 87.0 1
... ... ... ... ...
24087 2000-04-29 wk72 NaN 317
24088 2000-04-29 wk73 NaN 317
24089 2000-04-29 wk74 NaN 317
24090 2000-04-29 wk75 NaN 317
24091 2000-04-29 wk76 NaN 317
[24092 rows x 8 columns]
Finally, we subset the columns to the ones we want in our ratings dataframe.
billboard_ratings = billboard_ratings[
["id", "date.entered", "week", "rating"]
]
print(billboard_ratings)
id date.entered week rating
0 1 2000-02-26 wk1 87.0
1 1 2000-02-26 wk2 82.0
2 1 2000-02-26 wk3 72.0
3 1 2000-02-26 wk4 77.0
4 1 2000-02-26 wk5 87.0
... ... ... ... ...
24087 317 2000-04-29 wk72 NaN
24088 317 2000-04-29 wk73 NaN
24089 317 2000-04-29 wk74 NaN
24090 317 2000-04-29 wk75 NaN
24091 317 2000-04-29 wk76 NaN
[24092 rows x 4 columns]
This chapter explored how we can reduce the amount of duplicate information in data for efficient data storage. Data normalization can be thought of as the opposite process of preparing data for analysis, visualization, and model fitting. But typically you will need to combine multiple normalized data sets together into a tidy data set.
18.226.181.57