The pandas library provides a flexible and high-performance "groupby" facility that enables you to slice, dice, and summarize data sets. This process follows a pattern known as split-apply-combine. This pattern data is first categorized into groups based on a criteria such as the indexes or values within the columns. Each group is then processed with an aggregation or transformation function, returning a set of data with transformed values or a single aggregate summary for each group. pandas then combines all of these results and presents it in a single data structure.
We will start by seeing how pandas is used to split data. This will start with a demonstration of how to group data both using categorical values in the columns of a DataFrame
object or using the levels in the index of a pandas object. Using the result from a grouping operation, we will examine how to access the data in each group, as well as retrieve various basic statistical values of the groups.
The next section will focus on the apply portion of the pattern. This involves providing summaries of the groups via aggregation functions, transforming each row in a group into a new series of data, and removing groups of data based upon various criteria to prevent it from being in the results.
The chapter will close with a look at performing discretization of data in pandas. Although not properly a grouping function of pandas, discretization allows for data to be grouped into buckets, based upon ranges of values or to evenly distribute data across a number of buckets.
Specifically, in this chapter, we will cover:
To utilize the examples in this chapter, we will need to include the following imports and settings:
In [1]: # import pandas and numpy import numpy as np import pandas as pd # Set some pandas options for controlling output pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 10) pd.set_option('display.max_rows', 10) # inline graphics %matplotlib inline
3.145.51.153