How it works...

Most DataFrames will not have columns of booleans like our movie dataset. The most straightforward method to produce a boolean Series is to apply a condition to one of the columns using one of the comparison operators. In step 2, we use the greater than operator to test whether or not the duration of each movie was more than two hours (120 minutes). Steps 3 and 4 calculate two important quantities from a boolean Series, its sum and mean. These methods are possible as Python evaluates False/True as 0/1.

You can prove to yourself that the mean of a boolean Series represents the percentage of True values. To do this, use the value_counts method to count with the normalize parameter set to True to get its distribution:

>>> movie_2_hours.value_counts(normalize=True)
False    0.788649
True     0.211351
Name: duration, dtype: float64

Step 5 alerts us to the incorrect result from step 4. Even though the duration column had missing values, the boolean condition evaluated all these comparisons against missing values as False. Dropping these missing values allows us to calculate the correct statistic. This is done in one step through method chaining.

Step 6 shows that pandas treats boolean columns similarly to how it treats object data types by displaying frequency information. This is a natural way to think about boolean Series, rather than display quantiles like it does with numeric data.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...