Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The split, apply, and combine (SAC) pattern

Many data analysis problems utilize a pattern of processing data, known as split-apply-combine. In this pattern, three steps are taken to analyze data:

A data set is split into smaller pieces
Each of these pieces are operated upon independently
All of the results are combined back together and presented as a single unit

The following diagram demonstrates a simple split-apply-combine process to sum groups of numbers:

The split, apply, and combine (SAC) pattern

This process is actually very similar to the concepts in MapReduce. In MapReduce, massive sets of data, that are too big for a single computer, are divided into pieces and dispatched to many systems to spread the load in manageable pieces (split). Each system then performs analysis on the data and calculates a result (apply). The results are then collected from each system and used for decision making (combine).

Split-apply-combine, as implemented in pandas, differs in the scope of the data and processing. In pandas, all of the data is in memory of a single system. Because of this, it is limited to that single system's processing capabilities, but this also makes the data analysis for that scale of data faster and more interactive in nature.

Splitting in pandas is performed using the .groupby() method of a Series or DataFrame object, which given one or more index labels and/or column names, will divide the data based on the values present in the specified index labels and columns.

Once the data is split into groups, one or more of the following three broad classes of operations is applied:

Aggregation: This calculates a summary statistic, such as group means or counts of the items in each group
Transformation: This performs group- or item-specific calculations and returns a set of like-indexed results
Filtration: This removes entire groups of data based on a group level computation

The combine stage of the pattern is performed automatically by pandas, which will collect the results of the apply stage on all of the groups and construct a single merged result.

For more information on split-apply-combine, there is a paper from the Journal of Statistical Software titled The Split-Apply-Combine Strategy for Data Analysis. This paper goes into more details of the pattern, and although it utilizes R in its examples, it is still a valuable read for someone learning pandas. You can get this paper at http://www.jstatsoft.org/v40/i01/paper.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The split, apply, and combine (SAC) pattern

Create new playlist

Sign In

Sign Up

The split, apply, and combine (SAC) pattern

Table of Contents for
The split, apply, and combine (SAC) pattern