How to do it...

  1. Read in the flights dataset, and output the first five rows:
>>> flights = pd.read_csv('data/flights.csv')
>>> flights.head()
  1. If we want to find the distribution of airlines over a range of distances, we need to place the values of the DIST column into discrete bins. Let's use the pandas cut function to split the data into five bins:
>>> bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
>>> cuts = pd.cut(flights['DIST'], bins=bins)
>>> cuts.head()
0 (500.0, 1000.0] 1 (1000.0, 2000.0] 2 (500.0, 1000.0] 3 (1000.0, 2000.0] 4 (1000.0, 2000.0] Name: DIST, dtype: category Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] < (1000.0, 2000.0] < (2000.0, inf]]
  1. An ordered categorical Series is created. To help get an idea of what happened, let's count the values of each category:
>>> cuts.value_counts()
(500.0, 1000.0] 20659
(200.0, 500.0] 15874 (1000.0, 2000.0] 14186 (2000.0, inf] 4054 (-inf, 200.0] 3719 Name: DIST, dtype: int64
  1. The cuts Series can now be used to form groups. Pandas allows you to form groups in any way you wish. Pass the cuts Series to the groupby method and then call the value_counts method on the AIRLINE column to find the distribution for each distance group. Notice that SkyWest (OO) makes up 33% of flights less than 200 miles but only 16% of those between 200 and 500 miles:
>>> flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True) 
.round(3).head(15)
DIST AIRLINE (-inf, 200.0] OO 0.326 EV 0.289 MQ 0.211 DL 0.086 AA 0.052 UA 0.027 WN 0.009 (200.0, 500.0] WN 0.194 DL 0.189 OO 0.159 EV 0.156 MQ 0.100 AA 0.071 UA 0.062 VX 0.028 Name: AIRLINE, dtype: float64
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.242