The groupby operation

Through a groupby function, a chain of actions gets executed: splitting, applying, and combining. Splitting segments each category from the desired grouping variable to perform further operations with it. Then, functions can be individually applied across each of these split groups. These functions might involve aggregation (sum across a group or mean across a group), transformation (filling NAs within a group or sorting), filtration (applying conditions within a group to drop rows), or even a combination of these three operations. Finally, the results obtained after the functions are applied across each of the split groups are combined together.

Let's use sample data from a fictitious global retailer. The data available as CSV is read as a pandas DataFrame:

sales_data = pd.read_csv("salesdata.csv", encoding = "ISO-8859-1")

The head function will give us a quick glimpse of the dataset we just imported:

sales_data.head()

The following will be the output:

Snapshot of sample sales data

While a sample of five rows has been shown in the preceding output, the data contains 51,290 rows and 15 columns.

Now, to understand how groupby splits the data, let's split it by the Category variable. The object created is not a DataFrame but rather an object type unique to the groupby function:

category_grouped = sales_data.groupby("Category")
type(category_grouped)
pandas.core.groupby. DataFrameGroupBy

The grouping object is referred to as the key. Here, Category is the key. The groups under the groupby object created in the previous step are shown here. You can see that each group in Category is mapped to the row-index labels covered by each category:

category_grouped.groups

The following will be the output:

Information for each group

The data has four quantitative variables: Quantity, Sales, Discount, and Profit. Using groupby, let's compute the sum of all these four variables across each Category. This is an application of aggregation with groupby:

sales_data.groupby("Category").sum()

The following will be the output:

Results of groupby and summing

Modify the code slightly, as shown here, to compute only the sum of sales. This involves subsetting the data right before applying groupby:

sales_data[["Category", "Sales"]].groupby("Category").sum()

The following will be the output:

groupby and sum across one variable

Aggregation need not be applied across only a quantitative variable. Now, using groupby, let's find the Country in which each category was first ordered:

sales_data[["Category", "Country"]].groupby("Category").first()

The following will be the output:

Using the aggregate first along with groupby

The size() function helps to find the number of occurrences of each Category. After computing size, let's explore the transformation ability of groupby by sorting the results:

sales_data.groupby("Category").size().sort_values(ascending = True)

The following will be the output:

Aggregation of size after sorting

The key or grouping object need not necessarily be an existing column; it can also be a function defining a grouping rule. For example, from OrderDate, we can extract the year and then groupby the year in which orders were placed. For this, the index is first set to OrderDate:

index_by_date = sales_data.set_index('OrderDate')
index_by_date.groupby(lambda OrderDate: OrderDate.split('-')[2]).sum()

The following will be the output:

Groupby to group variables created via a custom& function

It is also possible to group by more than one key. Here, let's group by ShipMode and Category to aggregate by the number of observations. The groupby function accepts multiple variables as a list:

sales_data.groupby(["ShipMode","Category"]).size()

The following will be the output:

Aggregate of size across two grouping variables

The get_group() attribute of the groupby function allows data to be filtered by one category out of all of the categories available in the group:

sales_data.groupby("ShipMode").get_group("Same Day")

The following will be the output:

The get_group attribute of groupby

The groupby object produced by the groupby function is iterable. Let's iterate over a simple groupby object:

for name, group in sales_data.groupby("ShipMode"):
print(name)
print(group.iloc[0:5,0:5])

The following will be the output:

Iterating through the groupby object

Instead of grouping by a column name, an index can also be used. When using an index, the level can be specified in place of the index name. Let's set Region as an index to demonstrate this:

region_index_df = sales_data.set_index("Region", drop = True)
region_index_df.groupby(level = 0).sum()

The following will be the output:

Grouping with index

groupby aggregations need not always occur along a column. If required, items can be grouped and aggregated along a row by changing the axis argument. The default setting of the axis argument is 0. Changing it to axis = 1 groups items along a row:

sales_data.groupby("ShipMode", axis = 0).size()

Table of Contents for The groupby operation

Create new playlist

Sign In

Sign Up

Table of Contents for
The groupby operation