How to do it...

  1. Read the movie dataset, set the movie title as the index, and create the criteria:
>>> movie = pd.read_csv('data/movie.csv', index_col='movie_title')
>>> c1 = movie['title_year'] >= 2010
>>> c2 = movie['title_year'].isnull()
>>> criteria = c1 | c2
  1. Use the mask method on a DataFrame to make all the values in rows with movies that were made from 2010 onward missing. Any movie that originally had a missing value for title_year is also masked:
>>> movie.mask(criteria).head()
  1. Notice how all the values in the third, fourth, and fifth rows from the preceding DataFrame are missing. Chain the dropna method to remove rows that have all values missing:
>>> movie_mask = movie.mask(criteria).dropna(how='all')
>>> movie_mask.head()
  1. The operation in step 3 is just a complex way of doing basic boolean indexing. We can check whether the two methods produce the same DataFrame:
>>> movie_boolean = movie[movie['title_year'] < 2010]
>>> movie_mask.equals(movie_boolean)
False
  1. The equals method is telling us that they aren't equal. Something is wrong. Let's do some sanity checking and see if they are the same shape:
>>> movie_mask.shape == movie_boolean.shape
True
  1. When we used the preceding mask method, it created many missing values. Missing values are float data types so any previous integer column is now a float. The equals method returns False if the data types of the columns are different, even if the values are the same. Let's check the equality of the data types to see whether this scenario happened:
>>> movie_mask.dtypes == movie_boolean.dtypes
color True director_name True num_critic_for_reviews True duration True director_facebook_likes True actor_3_facebook_likes True actor_2_name True actor_1_facebook_likes True gross True genres True actor_1_name True num_voted_users False cast_total_facebook_likes False ..... dtype: bool
  1. It turns out that a couple of columns don't have the same data type. Pandas has an alternative for these situations. In its testing module, which is primarily used by developers, there is a function, assert_frame_equal, that allows you to check the equality of Series and DataFrames without also checking the equality of the data types:
from pandas.testing import assert_frame_equal
>>> assert_frame_equal(movie_boolean, movie_mask, check_dtype=False)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.242