How to do it...

  1. Read the movie dataset, set the movie title as the index, and select all the values in the actor_1_facebook_likes column that are not missing:
>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> fb_likes = movie['actor_1_facebook_likes'].dropna()
>>> fb_likes.head()
movie_title Avatar 1000.0 Pirates of the Caribbean: At World's End 40000.0 Spectre 11000.0 The Dark Knight Rises 27000.0 Star Wars: Episode VII - The Force Awakens 131.0 Name: actor_1_facebook_likes, dtype: float64
  1. Let's use the describe method to get a sense of the distribution:
>>> fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]) 
.astype(int)
count 4909 mean 6494 std 15106 min 0 10% 240 25% 607 50% 982 75% 11000 90% 18000 max 640000 Name: actor_1_facebook_likes, dtype: int64
  1. Additionally, we may plot a histogram of this Series to visually inspect the distribution:
>>> fb_likes.hist()
  1. This is quite a bad visualization and very difficult to get a sense of the distribution. On the other hand, the summary statistics from step 2 appear to be telling us that it is highly skewed to the right with many observations more than an order of magnitude greater than the median. Let's create criteria to test whether the number of likes is less than 20,000:
>>> criteria_high = fb_likes < 20000
>>> criteria_high.mean().round(2)
.91
  1. About 91% of the movies have an actor 1 with fewer than 20,000 likes. We will now use the where method, which accepts a boolean condition. The default behavior is to return a Series the same size as the original but which has all the False locations replaced with a missing value:
>>> fb_likes.where(criteria_high).head()
movie_title Avatar 1000.0 Pirates of the Caribbean: At World's End NaN Spectre 11000.0 The Dark Knight Rises NaN Star Wars: Episode VII - The Force Awakens 131.0 Name: actor_1_facebook_likes, dtype: float64
  1. The second parameter to the where method, other, allows you to control the replacement value. Let's change all the missing values to 20,000:
>>> fb_likes.where(criteria_high, other=20000).head()
movie_title Avatar 1000.0 Pirates of the Caribbean: At World's End 20000.0 Spectre 11000.0 The Dark Knight Rises 20000.0 Star Wars: Episode VII - The Force Awakens 131.0 Name: actor_1_facebook_likes, dtype: float64
  1. Similarly, we can create criteria to put a floor on the minimum number of likes. Here, we chain another where method and replace the values not meeting with the condition to 300:
>>> criteria_low = fb_likes > 300
>>> fb_likes_cap = fb_likes.where(criteria_high, other=20000)
.where(criteria_low, 300)
>>> fb_likes_cap.head()
movie_title Avatar 1000.0 Pirates of the Caribbean: At World's End 20000.0 Spectre 11000.0 The Dark Knight Rises 20000.0 Star Wars: Episode VII - The Force Awakens 300.0 Name: actor_1_facebook_likes, dtype: float64
  1. The length of the original Series and modified Series is the same:
>>> len(fb_likes), len(fb_likes_cap)
(4909, 4909)
  1. Let's make a histogram with the modified Series. With the data in a much tighter range, it should produce a better plot:
>>> fb_likes_cap.hist()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.151.44