Our examination of splitting a pandas objects will be broken into several sections. We will first load data to use in the examples. Then, we will look at creating a grouping based on columns, examining properties of a grouping in the process. Next, will be an examination of accessing the results of the grouping. The last subsection will examine grouping using index labels, instead of content in columns.
pandas' Series
and DataFrame
objects are split into groups using the .groupby()
method. To demonstrate, we will use a variant of the accelerometer sensor data introduced in the previous chapter. This version of the data adds another column (sensor) that can be used to specify multiple sensors:
In [2]: # load the sensors data sensors = pd.read_csv("data/sensors.csv") sensors Out[2]: interval sensor axis reading 0 0 accel Z 0.0 1 0 accel Y 0.5 2 0 accel X 1.0 3 1 accel Z 0.1 4 1 accel Y 0.4 .. ... ... ... ... 19 2 orientation Y 0.3 20 2 orientation X 0.2 21 3 orientation Z 0.0 22 3 orientation Y 0.4 23 3 orientation X 0.3 [24 rows x 4 columns]
The sensor data consists of three categorical variables (sensor
, interval
, and axis
) and one continuous variable (reading). In pandas, it is possible to group by any single categorical variable by passing its name to .groupby()
. The following groups by the sensor
column:
In [3]: # group this data by the sensor column / variable # returns a DataFrameGroupBy object grouped = sensors.groupby('sensor') grouped Out[3]: <pandas.core.groupby.DataFrameGroupBy object at 0x106915f90>
The result of calling .groupby()
on DataFrame
is not the actual grouped data, but a DataFrameGroupBy
object (SeriesGroupBy
when grouping on Series
). The actual process of grouping is a deferred/lazy process in pandas, and at this point, the grouping has not actually been performed. This object represents an interim description of the grouping to be performed. This allows pandas to first validate that the grouping description provided to it is valid, relative to the data before starting processing.
The .ngroups
property will retrieve the number of groups that will be formed in the result:
In [4]: # get the number of groups that this will create grouped.ngroups Out[4]: 2
The .groups
property will return a Python dictionary whose keys represent the names of each group (if multiple columns are specified, it is a tuple). The values in the dictionary are an array of the index labels contained within each respective group:
In [5]: # what are the groups that were found? grouped.groups Out[5]: {'accel': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'orientation': [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]}
The grouped
variable can be thought of as a collection of named groups. We will use these properties, and the following function, to examine many of the results of groupings:
In [6]: # a helper function to print the contents of the groups def print_groups (groupobject): # loop over all groups, printing the group name # and group details for name, group in groupobject: print (name) print (group)
Using this method, we can see the results of the grouping we made a few steps back:
In [7]: # examine the content of the groups we created print_groups(grouped) Out[7]: accel interval sensor axis reading 0 0 accel Z 0.0 1 0 accel Y 0.5 2 0 accel X 1.0 3 1 accel Z 0.1 4 1 accel Y 0.4 .. ... ... ... ... 7 2 accel Y 0.3 8 2 accel X 0.8 9 3 accel Z 0.3 10 3 accel Y 0.2 11 3 accel X 0.7 [12 rows x 4 columns] orientation interval sensor axis reading 12 0 orientation Z 0.0 13 0 orientation Y 0.1 14 0 orientation X 0.0 15 1 orientation Z 0.0 16 1 orientation Y 0.2 .. ... ... ... ... 19 2 orientation Y 0.3 20 2 orientation X 0.2 21 3 orientation Z 0.0 22 3 orientation Y 0.4 23 3 orientation X 0.3 [12 rows x 4 columns]
The examination of these results gives us some insight into how pandas has performed the split that we specified. A group has been created for each distinct value in the sensors
column and has been named with that value. The group contains a DataFrame object whose content is the rows where the sensor value matched the name of the group.
We can use the .size()
method to get a summary of the size of all the groups:
In [8]: # get how many items are in each group grouped.size() Out[8]: sensor accel 12 orientation 12 dtype: int64
The .count()
method will return the number of items in each column of every group:
In [9]: # get the count of items in each column of each group grouped.count() Out[9]: interval axis reading sensor accel 12 12 12 orientation 12 12 12
Any specific group can be retrieved using the .get_group()
property. The following retrieves the accel
group:
In [10]: # get the data in one specific group grouped.get_group('accel') Out[10]: interval sensor axis reading 0 0 accel Z 0.0 1 0 accel Y 0.5 2 0 accel X 1.0 3 1 accel Z 0.1 4 1 accel Y 0.4 .. ... ... ... ... 7 2 accel Y 0.3 8 2 accel X 0.8 9 3 accel Z 0.3 10 3 accel Y 0.2 11 3 accel X 0.7 [12 rows x 4 columns]
The .head()
and .tail()
methods can be used to return the specified number of items in each group. The following code retrieved the first three rows in each group. There are six rows returned, as we requested the first three rows in each group, and there are two groups:
In [11]: # get the first three items in each group grouped.head(3) Out[11]: interval sensor axis reading 0 0 accel Z 0.0 1 0 accel Y 0.5 2 0 accel X 1.0 12 0 orientation Z 0.0 13 0 orientation Y 0.1 14 0 orientation X 0.0
The .nth()
method will return the n-th item in each group. The following demonstrates how to use this to retrieve the first, the second, and then the third row of each group:
In [12]: # get the first item in each group grouped.nth(0) Out[12]: interval sensor axis reading sensor accel 0 accel Z 0 orientation 0 orientation Z 0 In [13]: # get the 2nd item in each group grouped.nth(1) Out[13]: interval sensor axis reading sensor accel 0 accel Y 0.5 orientation 0 orientation Y 0.1 In [14]: # and so on... grouped.nth(2) Out[14]: interval sensor axis reading sensor accel 0 accel X 1 orientation 0 orientation X 0
Grouping can also be performed on multiple columns by passing a list of column names. The following groups the data by both sensor
and axis
variables:
In [15]: # group by both sensor and axis values mcg = sensors.groupby(['sensor', 'axis']) print_groups(mcg) Out[15]: ('accel', 'X') interval sensor axis reading 2 0 accel X 1.0 5 1 accel X 0.9 8 2 accel X 0.8 11 3 accel X 0.7 ('accel', 'Y') interval sensor axis reading 1 0 accel Y 0.5 4 1 accel Y 0.4 7 2 accel Y 0.3 10 3 accel Y 0.2 ('accel', 'Z') interval sensor axis reading 0 0 accel Z 0.0 3 1 accel Z 0.1 6 2 accel Z 0.2 9 3 accel Z 0.3 ('orientation', 'X') interval sensor axis reading 14 0 orientation X 0.0 17 1 orientation X 0.1 20 2 orientation X 0.2 23 3 orientation X 0.3 ('orientation', 'Y') interval sensor axis reading 13 0 orientation Y 0.1 16 1 orientation Y 0.2 19 2 orientation Y 0.3 22 3 orientation Y 0.4 ('orientation', 'Z') interval sensor axis reading 12 0 orientation Z 0 15 1 orientation Z 0 18 2 orientation Z 0 21 3 orientation Z 0
Since multiple columns were specified, the name of each group is now a tuple with the value from both sensor
and axis
, which represents the group.
The .describe()
method can be used to return descriptive statistics for each group:
In [16]: # get descriptive statistics for each mcg.describe() Out[16]: interval reading sensor axis accel X count 4.000000 4.000000 mean 1.500000 0.850000 std 1.290994 0.129099 min 0.000000 0.700000 25% 0.750000 0.775000 ... ... ... orientation Z min 0.000000 0.000000 25% 0.750000 0.000000 50% 1.500000 0.000000 75% 2.250000 0.000000 max 3.000000 0.000000 [48 rows x 2 columns]
By default, groups are sorted by their group name in an ascending order. This dataset already has them in an ascending order, but if you want to prevent sorting during grouping, use the sort=False
option.
The examples up to this point, have used DataFrame
without any specific indexing (just the default sequential numerical index). This type of data would actually be very well suited for a hierarchical index. This can then be used directly to group the data based upon index label(s).
To demonstrate, the following script creates a new DataFrame
object with MultiIndex
, consisting of the original sensor and interval columns:
In [17]: # make a copy of the data and reindex the copy mi = sensors.copy() mi = mi.set_index(['sensor', 'axis']) mi Out[17]: interval reading sensor axis accel Z 0 0.0 Y 0 0.5 X 0 1.0 Z 1 0.1 Y 1 0.4 ... ... ... orientation Y 2 0.3 X 2 0.2 Z 3 0.0 Y 3 0.4 X 3 0.3 [24 rows x 2 columns]
Grouping can now be performed using the levels of the hierarchical index. The following groups by index level 0
(the sensor names):
In [18]: # group by the first level of the index mig_l1 = mi.groupby(level=0) print_groups(mig_l1) Out[18]: accel interval reading sensor axis accel Z 0 0.0 Y 0 0.5 X 0 1.0 Z 1 0.1 Y 1 0.4 ... ... ... Y 2 0.3 X 2 0.8 Z 3 0.3 Y 3 0.2 X 3 0.7 [12 rows x 2 columns] orientation interval reading sensor axis orientation Z 0 0.0 Y 0 0.1 X 0 0.0 Z 1 0.0 Y 1 0.2 ... ... ... Y 2 0.3 X 2 0.2 Z 3 0.0 Y 3 0.4 X 3 0.3 [12 rows x 2 columns]
Grouping by multiple levels can be performed by passing the levels in a list to .groupby()
. Also, if MultiIndex
has names specified for the levels, then these names can be used instead of integers. The following code groups the two levels of MultiIndex
by their names:
In [19]: # group by multiple levels of the index mig_l12 = mi.groupby(level=['sensor', 'axis']) print_groups(mig_l12) Out[19]: ('accel', 'X') interval reading sensor axis accel X 0 1.0 X 1 0.9 X 2 0.8 X 3 0.7 ('accel', 'Y') interval reading sensor axis accel Y 0 0.5 Y 1 0.4 Y 2 0.3 Y 3 0.2 ('accel', 'Z') interval reading sensor axis accel Z 0 0.0 Z 1 0.1 Z 2 0.2 Z 3 0.3 ('orientation', 'X') interval reading sensor axis orientation X 0 0.0 X 1 0.1 X 2 0.2 X 3 0.3 ('orientation', 'Y') interval reading sensor axis orientation Y 0 0.1 Y 1 0.2 Y 2 0.3 Y 3 0.4 ('orientation', 'Z') interval reading sensor axis orientation Z 0 0 Z 1 0 Z 2 0 Z 3 0
3.133.132.99