7

Data Normalization

The final point in the original “Tidy Data” paper stated that for data to be tidy “… each type of observational unit forms a table.” However, usually we need to combine multiple data sets together so we can do an analysis (Chapter 6). But when we think about how to store and manage data in a way where we reduce the amount of duplication and potential for errors, we should try to normalize our data into separate tables so a single fix can propagate when we combine the data together again.

Learning Objectives

  • Identify the differences between tidy data and data normalization

  • Apply data subsetting to split data into normalized parts

7.1 Multiple Observational Units in a Table (Normalization)

One of the simplest ways of knowing whether multiple observational units are represented in a table is by looking at each of the rows and taking note of any cells or values that are being repeated from row to row. This is very common in government education administration data, where student demographics are reported for each student for each year the student is enrolled, and in other data sets that track a value over time.

Let’s look again at the Billboard data we cleaned in Section 4.1.2.

import pandas as pd

billboard = pd.read_csv('data/billboard.csv')

billboard_long = billboard.melt(
  id_vars=["year", "artist", "track", "time", "date.entered"],
  var_name="week",
  value_name="rating",
)

print(billboard_long)

       year            artist                    track  time  
0      2000             2 Pac  Baby Don't Cry (Keep...  4:22
1      2000           2Ge+her  The Hardest Part Of ...  3:15
2      2000      3 Doors Down               Kryptonite  3:53
3      2000      3 Doors Down                    Loser  4:24
4      2000          504 Boyz            Wobble Wobble  3:35
...     ...               ...                      ...  ...
24087  2000       Yankee Grey     Another Nine Minutes  3:10
24088  2000  Yearwood, Trisha          Real Live Woman  3:55
24089  2000   Ying Yang Twins  Whistle While You Tw...  4:19
24090  2000     Zombie Nation            Kernkraft 400  3:30
24091  2000   matchbox twenty                     Bent  4:12

      date.entered  week  rating
0       2000-02-26   wk1    87.0
1       2000-09-02   wk1    91.0
2       2000-04-08   wk1    81.0
3       2000-10-21   wk1    76.0
4       2000-04-15   wk1    57.0
...            ...   ...     ...
24087   2000-04-29  wk76     NaN
24088   2000-04-01  wk76     NaN
24089   2000-03-18  wk76     NaN
24090   2000-09-02  wk76     NaN
24091   2000-04-29  wk76     NaN

[24092 rows x 7 columns]

Suppose we subset the data based on a particular track:

print(billboard_long.loc[billboard_long.track == 'Loser'])
       year        artist  track  time date.entered  week  rating
3      2000  3 Doors Down  Loser  4:24   2000-10-21   wk1    76.0
320    2000  3 Doors Down  Loser  4:24   2000-10-21   wk2    76.0
637    2000  3 Doors Down  Loser  4:24   2000-10-21   wk3    72.0
954    2000  3 Doors Down  Loser  4:24   2000-10-21   wk4    69.0
1271   2000  3 Doors Down  Loser  4:24   2000-10-21   wk5    67.0
...     ...           ...  ...     ...          ...   ...     ...
22510  2000  3 Doors Down  Loser  4:24   2000-10-21   wk72    NaN
22827  2000  3 Doors Down  Loser  4:24   2000-10-21   wk73    NaN
23144  2000  3 Doors Down  Loser  4:24   2000-10-21   wk74    NaN
23461  2000  3 Doors Down  Loser  4:24   2000-10-21   wk75    NaN
23778  2000  3 Doors Down  Loser  4:24   2000-10-21   wk76    NaN

[76 rows x 7 columns]

We can see that this table actually holds two types of data: the track information and the weekly ranking. It would be better to store the track information in a separate table. This way, the information stored in the year, artist, track, and time columns would not be repeated in the data set. This consideration is particularly important if the data is manually entered. Repeating the same values over and over during data entry increases the risk of inconsistent data.

We can place the year, artist, track, and time in a new dataframe, with each unique set of values being assigned a unique ID. We can then use this unique ID in a second dataframe that represents a date entered, song, date, week number, and ranking. This entire process can be thought of as reversing the steps in concatenating and merging data described in Chapter 6.

billboard_songs = billboard_long[
    ["year", "artist", "track", "time"]
]
print(billboard_songs.shape)
(24092, 4)

We know there are duplicate entries in this dataframe, so we need to drop the duplicate rows.

billboard_songs = billboard_songs.drop_duplicates()
print(billboard_songs.shape)
(317, 4)

We can then assign a unique value to each row of data. There are many ways you could do this, there we take the index value and add 1 so it doesn’t start with 0.

billboard_songs['id'] = billboard_songs.index + 1
print(billboard_songs)
     year            artist                    track  time   id
0    2000             2 Pac  Baby Don't Cry (Keep...  4:22    1
1    2000           2Ge+her  The Hardest Part Of ...  3:15    2
2    2000           3 Doors          Down Kryptonite  3:53    3
3    2000           3 Doors               Down Loser  4:24    4
4    2000          504 Boyz            Wobble Wobble  3:35    5
..    ...               ...                      ...   ...  ...
312  2000       Yankee Grey     Another Nine Minutes  3:10  313
313  2000  Yearwood, Trisha          Real Live Woman  3:55  314
314  2000   Ying Yang Twins  Whistle While You Tw...  4:19  315
315  2000     Zombie Nation            Kernkraft 400  3:30  316
316  2000   matchbox twenty                     Bent  4:12  317

[317 rows x 5 columns]

Now that we have a separate dataframe about songs, we can use the newly created id column to match a song to its weekly ranking.

# Merge the song dataframe to the original data set
billboard_ratings = billboard_long.merge(
    billboard_songs, on=["year", "artist", "track", "time"]
)
print(billboard_ratings.shape)
(24092, 8)
print(billboard_ratings)
       year           artist                    track  time  
0      2000            2 Pac  Baby Don't Cry (Keep...  4:22
1      2000            2 Pac  Baby Don't Cry (Keep...  4:22
2      2000            2 Pac  Baby Don't Cry (Keep...  4:22
3      2000            2 Pac  Baby Don't Cry (Keep...  4:22
4      2000            2 Pac  Baby Don't Cry (Keep...  4:22
...     ...              ...                      ...   ...
24087  2000  matchbox twenty                     Bent  4:12
24088  2000  matchbox twenty                     Bent  4:12
24089  2000  matchbox twenty                     Bent  4:12
24090  2000  matchbox twenty                     Bent  4:12
24091  2000  matchbox twenty                     Bent  4:12

      date.entered  week  rating  id
0       2000-02-26   wk1  87.0     1
1       2000-02-26   wk2  82.0     1
2       2000-02-26   wk3  72.0     1
3       2000-02-26   wk4  77.0     1
4       2000-02-26   wk5  87.0     1
...            ...   ...   ...   ...
24087   2000-04-29  wk72   NaN   317
24088   2000-04-29  wk73   NaN   317
24089   2000-04-29  wk74   NaN   317
24090   2000-04-29  wk75   NaN   317
24091   2000-04-29  wk76   NaN   317

[24092 rows x 8 columns]

Finally, we subset the columns to the ones we want in our ratings dataframe.

billboard_ratings = billboard_ratings[
     ["id", "date.entered", "week", "rating"]
]
print(billboard_ratings)
        id date.entered  week  rating
0        1   2000-02-26   wk1    87.0
1        1   2000-02-26   wk2    82.0
2        1   2000-02-26   wk3    72.0
3        1   2000-02-26   wk4    77.0
4        1   2000-02-26   wk5    87.0
...    ...          ...   ...     ...
24087  317   2000-04-29  wk72     NaN
24088  317   2000-04-29  wk73     NaN
24089  317   2000-04-29  wk74     NaN
24090  317   2000-04-29  wk75     NaN
24091  317   2000-04-29  wk76     NaN

[24092 rows x 4 columns]

Conclusion

This chapter explored how we can reduce the amount of duplicate information in data for efficient data storage. Data normalization can be thought of as the opposite process of preparing data for analysis, visualization, and model fitting. But typically you will need to combine multiple normalized data sets together into a tidy data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.197.251