How to do it...

Read in the college dataset, and find the mean and standard deviation of the undergraduate population by state:

>>> college = pd.read_csv('data/college.csv')
>>> college.groupby('STABBR')['UGDS'].agg(['mean', 'std']) 
                                     .round(0).head()

This output isn't quite what we desire. We are not looking for the mean and standard deviations of the entire group but the maximum number of standard deviations away from the mean for any one institution. In order to calculate this, we need to subtract the mean undergraduate population by state from each institution's undergraduate population and then divide by the standard deviation. This standardizes the undergraduate population for each group. We can then take the maximum of the absolute value of these scores to find the one that is farthest away from the mean. Pandas does not provide a function capable of doing this. Instead, we will need to create a custom function:

>>> def max_deviation(s):
        std_score = (s - s.mean()) / s.std()
        return std_score.abs().max()

After defining the function, pass it directly to the agg method to complete the aggregation:

>>> college.groupby('STABBR')['UGDS'].agg(max_deviation) 
                                     .round(1).head()
STABBR
AK    2.6
AL    5.8
AR    6.3
AS    NaN
AZ    9.9
Name: UGDS, dtype: float64

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...