How to do it...

  1. Read in the college dataset, and drop any rows that have missing values in either the UGDS, SATMTMID, or SATVRMID columns. We must have non-missing values for each of these three columns:
>>> college = pd.read_csv('data/college.csv')
>>> subset = ['UGDS', 'SATMTMID', 'SATVRMID']
>>> college2 = college.dropna(subset=subset)
>>> college.shape
(7535, 27)

>>> college2.shape
(1184, 27)
  1. The vast majority of institutions do not have data for our three required columns, but this is still more than enough data to continue. Next, create a user-defined function to calculate the weighted average of just the SAT math scores:
>>> def weighted_math_average(df):
weighted_math = df['UGDS'] * df['SATMTMID']
return int(weighted_math.sum() / df['UGDS'].sum())
  1. Group by state and pass this function to the apply method:
>>> college2.groupby('STABBR').apply(weighted_math_average).head()
STABBR AK 503 AL 536 AR 529 AZ 569 CA 564 dtype: int64
  1. We successfully returned a scalar value for each group. Let's take a small detour and see what the outcome would have been by passing the same function to the agg method:
>>> college2.groupby('STABBR').agg(weighted_math_average).head()
  1. The weighted_math_average function gets applied to each non-aggregating column in the DataFrame. If you try and limit the columns to just SATMTMID, you will get an error as you won't have access to UGDS. So, the best way to complete operations that act on multiple columns is with apply:
>>> college2.groupby('STABBR')['SATMTMID'] 
.agg(weighted_math_average)
KeyError: 'UGDS'
  1. A nice feature of apply is that you can create multiple new columns by returning a Series. The index of this returned Series will be the new column names. Let's modify our function to calculate the weighted and arithmetic average for both SAT scores along with the count of the number of institutions from each group. We return these five values in a Series:
>>> from collections import OrderedDict
>>> def weighted_average(df):
data = OrderedDict()
weight_m = df['UGDS'] * df['SATMTMID']
weight_v = df['UGDS'] * df['SATVRMID']

wm_avg = weight_m.sum() / df['UGDS'].sum()
wv_avg = weight_v.sum() / df['UGDS'].sum()

data['weighted_math_avg'] = wm_avg
data['weighted_verbal_avg'] = wv_avg
data['math_avg'] = df['SATMTMID'].mean()
data['verbal_avg'] = df['SATVRMID'].mean()
data['count'] = len(df)
return pd.Series(data, dtype='int')

>>> college2.groupby('STABBR').apply(weighted_average).head(10)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.214.27