Grouping, aggregations, and transforms

One of the most appreciated features of Pandas is the simple and concise expression of data analysis pipelines that requires grouping, transforming, and aggregating the data. To demonstrate this concept, let's extend our dataset by adding two new patients to whom we didn't administer the treatment (this is usually called a control group). We also include a column, drug_admst, which records whether the patient was administered the treatment:

    patients = ["a", "b", "c", "d", "e", "f"]

columns = {
"sys_initial": [120, 126, 130, 115, 150, 117],
"dia_initial": [75, 85, 90, 87, 90, 74],
"sys_final": [115, 123, 130, 118, 130, 121],
"dia_final": [70, 82, 92, 87, 85, 74],
"drug_admst": [True, True, True, True, False, False]
}

df = pd.DataFrame(columns, index=patients)

At this point, we may be interested to know how the blood pressure changed between the two groups. You can group the patients according to drug_amst using the pd.DataFrame.groupby function. The return value will be the DataFrameGroupBy object, which can be iterated to obtain a new pd.DataFrame for each value of the drug_admst column:

    df.groupby('drug_admst')
for value, group in df.groupby('drug_admst'):
print("Value: {}".format(value))
print("Group DataFrame:")
print(group)
# Output:
# Value: False
# Group DataFrame:
# dia_final dia_initial drug_admst sys_final sys_initial
# e 85 90 False 130 150
# f 74 74 False 121 117
# Value: True
# Group DataFrame:
# dia_final dia_initial drug_admst sys_final sys_initial
# a 70 75 True 115 120
# b 82 85 True 123 126
# c 92 90 True 130 130
# d 87 87 True 118 115

Iterating on the DataFrameGroupBy object is almost never necessary, because, thanks to method chaining, it is possible to calculate group-related properties directly. For example, we may want to calculate mean, max, or standard deviation for each group. All those operations that summarize the data in some way are called aggregations and can be performed using the agg method. The result of agg is another pd.DataFrame that relates the grouping variables and the result of the aggregation, as illustrated in the following code:

df.groupby('drug_admst').agg(np.mean)
# dia_final dia_initial sys_final sys_initial
# drug_admst
# False 79.50 82.00 125.5 133.50
# True 82.75 84.25 121.5 122.75

It is also possible to perform processing on the DataFrame groups that do not represent a summarization. One common example of such an operation is filling in missing values. Those intermediate steps are called transforms.

We can illustrate this concept with an example. Let's assume that we have a few missing values in our dataset, and we want to replace those values with the average of the other values in the same group. This can be accomplished using a transform, as follows:

df.loc['a','sys_initial'] = None
df.groupby('drug_admst').transform(lambda df: df.fillna(df.mean()))
# dia_final dia_initial sys_final sys_initial
# a 70 75 115 123.666667
# b 82 85 123 126.000000
# c 92 90 130 130.000000
# d 87 87 118 115.000000
# e 85 90 130 150.000000
# f 74 74 121 117.000000
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.206.25