Grouping, aggregations, and transforms

One of the most appreciated features of Pandas is the simple and concise expression of data analysis pipelines that requires grouping, transforming, and aggregating the data. To demonstrate this concept, let's extend our dataset by adding two new patients to whom we didn't administer the treatment (this is usually called a control group). We also include a column, drug_admst, which records whether the patient was administered the treatment:

    patients = ["a", "b", "c", "d", "e", "f"]

    columns = {
      "sys_initial": [120, 126, 130, 115, 150, 117],
      "dia_initial": [75, 85, 90, 87, 90, 74],
      "sys_final": [115, 123, 130, 118, 130, 121],
      "dia_final": [70, 82, 92, 87, 85, 74],
      "drug_admst": [True, True, True, True, False, False]
    }

    df = pd.DataFrame(columns, index=patients)

At this point, we may be interested to know how the blood pressure changed between the two groups. You can group the patients according to drug_amst using the pd.DataFrame.groupby function. The return value will be the DataFrameGroupBy object, which can be iterated to obtain a new pd.DataFrame for each value of the drug_admst column:

    df.groupby('drug_admst')
    for value, group in df.groupby('drug_admst'):
        print("Value: {}".format(value))
        print("Group DataFrame:")
        print(group)
# Output:
# Value: False
# Group DataFrame:
#    dia_final   dia_initial   drug_admst   sys_final   sys_initial
# e         85            90        False         130           150
# f         74            74        False         121           117
# Value: True
# Group DataFrame:
#    dia_final   dia_initial   drug_admst   sys_final   sys_initial
# a         70            75         True         115           120
# b         82            85         True         123           126
# c         92            90         True         130           130
# d         87            87         True         118           115

Iterating on the DataFrameGroupBy object is almost never necessary, because, thanks to method chaining, it is possible to calculate group-related properties directly. For example, we may want to calculate mean, max, or standard deviation for each group. All those operations that summarize the data in some way are called aggregations and can be performed using the agg method. The result of agg is another pd.DataFrame that relates the grouping variables and the result of the aggregation, as illustrated in the following code:

df.groupby('drug_admst').agg(np.mean)
#              dia_final   dia_initial   sys_final   sys_initial
# drug_admst 
# False            79.50         82.00       125.5        133.50
# True             82.75         84.25       121.5        122.75

It is also possible to perform processing on the DataFrame groups that do not represent a summarization. One common example of such an operation is filling in missing values. Those intermediate steps are called transforms.

We can illustrate this concept with an example. Let's assume that we have a few missing values in our dataset, and we want to replace those values with the average of the other values in the same group. This can be accomplished using a transform, as follows:

df.loc['a','sys_initial'] = None
df.groupby('drug_admst').transform(lambda df: df.fillna(df.mean())) 
#     dia_final    dia_initial   sys_final   sys_initial
# a          70             75         115    123.666667
# b          82             85         123    126.000000
# c          92             90         130    130.000000
# d          87             87         118    115.000000
# e          85             90         130    150.000000
# f          74             74         121    117.000000

Table of Contents for Grouping, aggregations, and transforms

Create new playlist

Sign In

Sign Up

Table of Contents for
Grouping, aggregations, and transforms