How to do it...

Read in the altered movie dataset, and output the first five rows:

>>> movie = pd.read_csv('data/movie_altered.csv')
>>> movie.head()

This dataset contains information on the movie itself, the director, and actors. These three entities can be considered observational units. Before we start, let's use the insert method to create a column to uniquely identify each movie:

>>> movie.insert(0, 'id', np.arange(len(movie)))
>>> movie.head()

Let's attempt to tidy this dataset with the wide_to_long function to put all the actors in one column and their corresponding Facebook likes in another, and do the same for the director, even though there is only one per movie:

>>> stubnames = ['director', 'director_fb_likes',
                 'actor', 'actor_fb_likes']
>>> movie_long = pd.wide_to_long(movie, 
                                 stubnames=stubnames, 
                                 i='id', 
                                 j='num', 
                                 sep='_').reset_index()

>>> movie_long['num'] = movie_long['num'].astype(int)
>>> movie_long.head(9)

The dataset is now ready to be split into multiple smaller tables:

>>> movie_table = movie_long[['id', 'year', 'duration', 'rating']]
>>> director_table = movie_long[['id', 'num',
                                 'director', 'director_fb_likes']]
>>> actor_table = movie_long[['id', 'num',
                              'actor', 'actor_fb_likes']]

There are still several issues with these tables. The movie table duplicates each movie three times, the director table has two missing rows for each ID, and a few movies have missing values for some of the actors. Let's take care of these issues:

>>> movie_entity = movie_entity.drop_duplicates() 
                               .reset_index(drop=True)
>>> director_entity = director_entity.dropna() 
                                     .reset_index(drop=True)
>>> actor_table = actor_table.dropna() 
                             .reset_index(drop=True)

Now that we have separated the observational units into their own tables, let's compare the memory of the original dataset with these three tables:

>>> movie.memory_usage(deep=True).sum()
2318234

>>> movie_table.memory_usage(deep=True).sum() + 
    director_table.memory_usage(deep=True).sum() + 
    actor_table.memory_usage(deep=True).sum()
2627306

Our new tidier data actually takes up a little more memory. This is to be expected, as all the data in the original columns are simply spread out into the new tables. The new tables also each have an index, and two of them have an extra num column, which accounts for the extra memory. We can, however, take advantage of the fact that the count of Facebook likes is independent of the movie, meaning that each actor and director has exactly one count of Facebook likes for all movies. Before we can do this, we need to create another table mapping each movie to each actor/director. Let's first create id columns specific to the actor and director tables, uniquely identifying each actor/director:

>>> director_cat = pd.Categorical(director_table['director'])
>>> director_table.insert(1, 'director_id', director_cat.codes)

>>> actor_cat = pd.Categorical(actor_table['actor'])
>>> actor_table.insert(1, 'actor_id', actor_cat.codes)

We can use these tables to form our intermediate tables and unique actor/director tables. Let's first do this with the director tables:

>>> director_associative = director_table[['id', 'director_id',
                                           'num']]
>>> dcols = ['director_id', 'director', 'director_fb_likes']
>>> director_unique = director_table[dcols].drop_duplicates() 
                                           .reset_index(drop=True)

Let's do the same thing with the actor table:

>>> actor_associative = actor_table[['id', 'actor_id', 'num']]
>>> acols = ['actor_id', 'actor', 'actor_fb_likes']
>>> actor_unique = actor_table[acols].drop_duplicates() 
                                     .reset_index(drop=True)

Let's find out how much memory our new tables consume:

>>> movie_table.memory_usage(deep=True).sum() + 
    director_associative.memory_usage(deep=True).sum() + 
    director_unique.memory_usage(deep=True).sum() + 
    actor_associative.memory_usage(deep=True).sum() + 
    actor_unique.memory_usage(deep=True).sum()
1833402

Now that we have normalized our tables, we can build an entity-relationship diagram showing all the tables (entities), columns, and relationships. This diagram was created with the easy to use ERDPlus (https://erdplus.com):

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...