The split, apply, and combine (SAC) pattern

Many data analysis problems utilize a pattern of processing data, known as split-apply-combine. In this pattern, three steps are taken to analyze data:

  1. A data set is split into smaller pieces
  2. Each of these pieces are operated upon independently
  3. All of the results are combined back together and presented as a single unit

The following diagram demonstrates a simple split-apply-combine process to sum groups of numbers:

The split, apply, and combine (SAC) pattern

This process is actually very similar to the concepts in MapReduce. In MapReduce, massive sets of data, that are too big for a single computer, are divided into pieces and dispatched to many systems to spread the load in manageable pieces (split). Each system then performs analysis on the data and calculates a result (apply). The results are then collected from each system and used for decision making (combine).

Split-apply-combine, as implemented in pandas, differs in the scope of the data and processing. In pandas, all of the data is in memory of a single system. Because of this, it is limited to that single system's processing capabilities, but this also makes the data analysis for that scale of data faster and more interactive in nature.

Splitting in pandas is performed using the .groupby() method of a Series or DataFrame object, which given one or more index labels and/or column names, will divide the data based on the values present in the specified index labels and columns.

Once the data is split into groups, one or more of the following three broad classes of operations is applied:

  • Aggregation: This calculates a summary statistic, such as group means or counts of the items in each group
  • Transformation: This performs group- or item-specific calculations and returns a set of like-indexed results
  • Filtration: This removes entire groups of data based on a group level computation

The combine stage of the pattern is performed automatically by pandas, which will collect the results of the apply stage on all of the groups and construct a single merged result.

For more information on split-apply-combine, there is a paper from the Journal of Statistical Software titled The Split-Apply-Combine Strategy for Data Analysis. This paper goes into more details of the pattern, and although it utilizes R in its examples, it is still a valuable read for someone learning pandas. You can get this paper at http://www.jstatsoft.org/v40/i01/paper.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.104.242