How it works...

There does not exist a predefined pandas function to calculate the maximum number of standard deviations away from the mean. We were forced to construct a customized function in step 2. Notice that this custom function max_deviation accepts a single parameter, s. Looking ahead at step 3, you will notice that the function name is placed inside the agg method without directly being called. Nowhere is the parameter s explicitly passed to max_deviation. Instead, pandas implicitly passes the UGDS column as a Series to max_deviation.

The max_deviation function is called once for each group. As s is a Series, all normal Series methods are available. It subtracts the mean of that particular grouping from each of the values in the group before dividing by the standard deviation in a process called standardization.

Standardization is a common statistical procedure to understand how greatly individual values vary from the mean. For a normal distribution, 99.7% of the data lies within three standard deviations of the mean.

As we are interested in absolute deviation from the mean, we take the absolute value from all the standardized scores and return the maximum. The agg method necessitates that a single scalar value must be returned from our custom function, or else an exception will be raised. Pandas defaults to using the sample standard deviation which is undefined for any groups with just a single value. For instance, the state abbreviation AS (American Samoa) has a missing value returned as it has only a single institution in the dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.35.72