How to do it...

Read in the college dataset, and drop any rows that have missing values in either the UGDS, SATMTMID, or SATVRMID columns. We must have non-missing values for each of these three columns:

>>> college = pd.read_csv('data/college.csv')
>>> subset = ['UGDS', 'SATMTMID', 'SATVRMID']
>>> college2 = college.dropna(subset=subset)
>>> college.shape
(7535, 27)

>>> college2.shape
(1184, 27)

The vast majority of institutions do not have data for our three required columns, but this is still more than enough data to continue. Next, create a user-defined function to calculate the weighted average of just the SAT math scores:

>>> def weighted_math_average(df):
        weighted_math = df['UGDS'] * df['SATMTMID']
        return int(weighted_math.sum() / df['UGDS'].sum())

Group by state and pass this function to the apply method:

>>> college2.groupby('STABBR').apply(weighted_math_average).head()
STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
dtype: int64

We successfully returned a scalar value for each group. Let's take a small detour and see what the outcome would have been by passing the same function to the agg method:

>>> college2.groupby('STABBR').agg(weighted_math_average).head()

The weighted_math_average function gets applied to each non-aggregating column in the DataFrame. If you try and limit the columns to just SATMTMID, you will get an error as you won't have access to UGDS. So, the best way to complete operations that act on multiple columns is with apply:

>>> college2.groupby('STABBR')['SATMTMID'] 
            .agg(weighted_math_average)
KeyError: 'UGDS'

A nice feature of apply is that you can create multiple new columns by returning a Series. The index of this returned Series will be the new column names. Let's modify our function to calculate the weighted and arithmetic average for both SAT scores along with the count of the number of institutions from each group. We return these five values in a Series:

>>> from collections import OrderedDict
>>> def weighted_average(df):
        data = OrderedDict()
        weight_m = df['UGDS'] * df['SATMTMID']
        weight_v = df['UGDS'] * df['SATVRMID']
    
        wm_avg = weight_m.sum() / df['UGDS'].sum()
        wv_avg = weight_v.sum() / df['UGDS'].sum()

        data['weighted_math_avg'] = wm_avg
        data['weighted_verbal_avg'] = wv_avg
        data['math_avg'] = df['SATMTMID'].mean()
        data['verbal_avg'] = df['SATVRMID'].mean()
        data['count'] = len(df)
        return pd.Series(data, dtype='int')

>>> college2.groupby('STABBR').apply(weighted_average).head(10)

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...