Many data analysis problems utilize a pattern of processing data, known as split-apply-combine. In this pattern, three steps are taken to analyze data:
The following diagram demonstrates a simple split-apply-combine process to sum groups of numbers:
This process is actually very similar to the concepts in MapReduce. In MapReduce, massive sets of data, that are too big for a single computer, are divided into pieces and dispatched to many systems to spread the load in manageable pieces (split). Each system then performs analysis on the data and calculates a result (apply). The results are then collected from each system and used for decision making (combine).
Split-apply-combine, as implemented in pandas, differs in the scope of the data and processing. In pandas, all of the data is in memory of a single system. Because of this, it is limited to that single system's processing capabilities, but this also makes the data analysis for that scale of data faster and more interactive in nature.
Splitting in pandas is performed using the .groupby()
method of a Series
or DataFrame
object, which given one or more index labels and/or column names, will divide the data based on the values present in the specified index labels and columns.
Once the data is split into groups, one or more of the following three broad classes of operations is applied:
The combine stage of the pattern is performed automatically by pandas, which will collect the results of the apply stage on all of the groups and construct a single merged result.
For more information on split-apply-combine, there is a paper from the Journal of Statistical Software titled The Split-Apply-Combine Strategy for Data Analysis. This paper goes into more details of the pattern, and although it utilizes R in its examples, it is still a valuable read for someone learning pandas. You can get this paper at http://www.jstatsoft.org/v40/i01/paper.
3.144.242.235