Chapter 9. Grouping and Aggregating Data

The pandas library provides a flexible and high-performance "groupby" facility that enables you to slice, dice, and summarize data sets. This process follows a pattern known as split-apply-combine. This pattern data is first categorized into groups based on a criteria such as the indexes or values within the columns. Each group is then processed with an aggregation or transformation function, returning a set of data with transformed values or a single aggregate summary for each group. pandas then combines all of these results and presents it in a single data structure.

We will start by seeing how pandas is used to split data. This will start with a demonstration of how to group data both using categorical values in the columns of a DataFrame object or using the levels in the index of a pandas object. Using the result from a grouping operation, we will examine how to access the data in each group, as well as retrieve various basic statistical values of the groups.

The next section will focus on the apply portion of the pattern. This involves providing summaries of the groups via aggregation functions, transforming each row in a group into a new series of data, and removing groups of data based upon various criteria to prevent it from being in the results.

The chapter will close with a look at performing discretization of data in pandas. Although not properly a grouping function of pandas, discretization allows for data to be grouped into buckets, based upon ranges of values or to evenly distribute data across a number of buckets.

Specifically, in this chapter, we will cover:

  • An overview of the split, apply, and combine pattern for data analysis
  • Grouping by column values
  • Accessing the results of grouping
  • Grouping using index levels
  • Applying functions to groups to create aggregate results
  • Transforming groups of data using filtering to selectively remove groups of data
  • The discretization of continuous data into bins

Setting up the IPython notebook

To utilize the examples in this chapter, we will need to include the following imports and settings:

In [1]:
   # import pandas and numpy
   import numpy as np
   import pandas as pd

   # Set some pandas options for controlling output
   pd.set_option('display.notebook_repr_html', False)
   pd.set_option('display.max_columns', 10)
   pd.set_option('display.max_rows', 10)

   # inline graphics
   %matplotlib inline
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.51.153