How to do it...

  1. The simplest way to create a new column is to assign it a scalar value. Place the name of the new column as a string into the indexing operator. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:
>>> movie = pd.read_csv('data/movie.csv')
>>> movie['has_seen'] = 0
  1. There are several columns that contain data on the number of Facebook likes. Let's add up all the actor and director Facebook likes and assign them to the actor_director_facebook_likes column:
>>> movie['actor_director_facebook_likes'] =  
(movie['actor_1_facebook_likes'] +
movie['actor_2_facebook_likes'] +
movie['actor_3_facebook_likes'] +
movie['director_facebook_likes'])
  1. From the Calling Series method recipe in this chapter, we know that this dataset contains missing values. When numeric columns are added to one another as in the preceding step, pandas defaults missing values to zero. But, if all values for a particular row are missing, then pandas keeps the total as missing as well. Let's check if there are missing values in our new column and fill them with 0:
>>> movie['actor_director_facebook_likes'].isnull().sum()
122
>>> movie['actor_director_facebook_likes'] =
movie['actor_director_facebook_likes'].fillna(0)
  1. There is another column in the dataset named cast_total_facebook_likes. It would be interesting to see what percentage of this column comes from our newly created column, actor_director_facebook_likes. Before we create our percentage column, let's do some basic data validation. Let's ensure that cast_total_facebook_likes is greater than or equal to actor_director_facebook_likes:
>>> movie['is_cast_likes_more'] = 
(movie['cast_total_facebook_likes'] >=
movie['actor_director_facebook_likes'])
  1. is_cast_likes_more is now a column of boolean values. We can check whether all the values of this column are True with the all Series method:
>>> movie['is_cast_likes_more'].all()
False
  1. It turns out that there is at least one movie with more actor_director_facebook_likes than cast_total_facebook_likes. It could be that director Facebook likes are not part of the cast total likes. Let's backtrack and delete column actor_director_facebook_likes:
>>> movie = movie.drop('actor_director_facebook_likes',
axis='columns')
  1. Let's recreate a column of just the total actor likes:
>>> movie['actor_total_facebook_likes'] = 
(movie['actor_1_facebook_likes'] +
movie['actor_2_facebook_likes'] +
movie['actor_3_facebook_likes'])

>>> movie['actor_total_facebook_likes'] =
movie['actor_total_facebook_likes'].fillna(0)
  1. Check again whether all the values in cast_total_facebook_likes are greater than the actor_total_facebook_likes:
>>> movie['is_cast_likes_more'] = 
(movie['cast_total_facebook_likes'] >=
movie['actor_total_facebook_likes'])

>>> movie['is_cast_likes_more'].all()
True
  1. Finally, let's calculate the percentage of the cast_total_facebook_likes that come from actor_total_facebook_likes:
>>> movie['pct_actor_cast_like'] = 
(movie['actor_total_facebook_likes'] /
movie['cast_total_facebook_likes'])
  1. Let's validate that the min and max of this column fall between 0 and 1:
>>> (movie['pct_actor_cast_like'].min(), 
movie['pct_actor_cast_like'].max())
(0.0, 1.0)
  1. We can then output this column as a Series. First, we need to set the index to the movie title so we can properly identify each value.
>>> movie.set_index('movie_title')['pct_actor_cast_like'].head()
movie_title Avatar 0.577369 Pirates of the Caribbean: At World's End 0.951396 Spectre 0.987521 The Dark Knight Rises 0.683783 Star Wars: Episode VII - The Force Awakens 0.000000 Name: pct_actor_cast_like, dtype: float64
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.178.53