How to do it...

  1. Before we get started with the actual flights dataset, let's practice counting streaks of ones with a small sample Series:
>>> s = pd.Series([0, 1, 1, 0, 1, 1, 1, 0])
>>> s
0 0
1 1
2 1 3 0 4 1 5 1 6 1 7 0 dtype: int64
  1. Our final representation of the streaks of ones will be a Series of the same length as the original with an independent count beginning from one for each streak. To get started, let's use the cumsum method:
>>> s1 = s.cumsum()
>>> s1
0 0 1 1 2 2 3 2 4 3 5 4 6 5 7 5 dtype: int64
  1. We have now accumulated all the ones going down the Series. Let's multiply this Series by the original:
>>> s.mul(s1)
0 0 1 1 2 2 3 0 4 3 5 4 6 5 7 0 dtype: int64
  1. We have only non-zero values where we originally had ones. This result is fairly close to what we desire. We just need to restart each streak at one instead of where the cumulative sum left off. Let's chain the diff method, which subtracts the previous value from the current:
>>> s.mul(s1).diff()
0 NaN 1 1.0 2 1.0 3 -2.0 4 3.0 5 1.0 6 1.0 7 -5.0 dtype: float64
  1. A negative value represents the end of a streak. We need to propagate the negative values down the Series and use them to subtract away the excess accumulation from step 2. To do this, we will make all non-negative values missing with the where method:
>>> s.mul(s1).diff().where(lambda x: x < 0)
0 NaN
1 NaN 2 NaN 3 -2.0 4 NaN 5 NaN 6 NaN 7 -5.0 dtype: float64
  1. We can now propagate these values down with the ffill method:
>>> s.mul(s1).diff().where(lambda x: x < 0).ffill()
0 NaN 1 NaN 2 NaN 3 -2.0 4 -2.0 5 -2.0 6 -2.0 7 -5.0 dtype: float64
  1. Finally, we can add this Series back to s1 to clear out the excess accumulation:
>>> s.mul(s1).diff().where(lambda x: x < 0).ffill() 
.add(s1, fill_value=0)
0 0.0 1 1.0 2 2.0 3 0.0 4 1.0 5 2.0 6 3.0 7 0.0 dtype: float64
  1. Now that we have a working consecutive streak finder, we can find the longest streak per airline and origin airport. Let's read in the flights dataset and create a column to represent on-time arrival:
>>> flights = pd.read_csv('data/flights.csv')
>>> flights['ON_TIME'] = flights['ARR_DELAY'].lt(15).astype(int)
>>> flights[['AIRLINE', 'ORG_AIR', 'ON_TIME']].head(10)
  1. Use our logic from the first seven steps to define a function that returns the maximum streak of ones for a given Series:
>>> def max_streak(s):
s1 = s.cumsum()
return s.mul(s1).diff().where(lambda x: x < 0)
.ffill().add(s1, fill_value=0).max()
  1. Find the maximum streak of on-time arrivals per airline and origin airport along with the total number of flights and percentage of on-time arrivals. First, sort the day of the year and scheduled departure time:
>>> flights.sort_values(['MONTH', 'DAY', 'SCHED_DEP']) 
.groupby(['AIRLINE', 'ORG_AIR'])['ON_TIME']
.agg(['mean', 'size', max_streak]).round(2).head()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.102.50