Split

Our examination of splitting a pandas objects will be broken into several sections. We will first load data to use in the examples. Then, we will look at creating a grouping based on columns, examining properties of a grouping in the process. Next, will be an examination of accessing the results of the grouping. The last subsection will examine grouping using index labels, instead of content in columns.

Data for the examples

pandas' Series and DataFrame objects are split into groups using the .groupby() method. To demonstrate, we will use a variant of the accelerometer sensor data introduced in the previous chapter. This version of the data adds another column (sensor) that can be used to specify multiple sensors:

In [2]:
   # load the sensors data
   sensors = pd.read_csv("data/sensors.csv")
   sensors

Out[2]:
       interval       sensor axis  reading
   0          0        accel    Z      0.0
   1          0        accel    Y      0.5
   2          0        accel    X      1.0
   3          1        accel    Z      0.1
   4          1        accel    Y      0.4
   ..       ...          ...  ...      ...
   19         2  orientation    Y      0.3
   20         2  orientation    X      0.2
   21         3  orientation    Z      0.0
   22         3  orientation    Y      0.4
   23         3  orientation    X      0.3

   [24 rows x 4 columns]

Grouping by a single column's values

The sensor data consists of three categorical variables (sensor, interval, and axis) and one continuous variable (reading). In pandas, it is possible to group by any single categorical variable by passing its name to .groupby(). The following groups by the sensor column:

In [3]:
   # group this data by the sensor column / variable
   # returns a DataFrameGroupBy object
   grouped = sensors.groupby('sensor')
   grouped

Out[3]:
   <pandas.core.groupby.DataFrameGroupBy object at 0x106915f90>

The result of calling .groupby() on DataFrame is not the actual grouped data, but a DataFrameGroupBy object (SeriesGroupBy when grouping on Series). The actual process of grouping is a deferred/lazy process in pandas, and at this point, the grouping has not actually been performed. This object represents an interim description of the grouping to be performed. This allows pandas to first validate that the grouping description provided to it is valid, relative to the data before starting processing.

The .ngroups property will retrieve the number of groups that will be formed in the result:

In [4]:
   # get the number of groups that this will create
   grouped.ngroups

Out[4]:
   2

The .groups property will return a Python dictionary whose keys represent the names of each group (if multiple columns are specified, it is a tuple). The values in the dictionary are an array of the index labels contained within each respective group:

In [5]:
   # what are the groups that were found?
   grouped.groups

Out[5]:
   {'accel': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'orientation': [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]}

Accessing the results of grouping

The grouped variable can be thought of as a collection of named groups. We will use these properties, and the following function, to examine many of the results of groupings:

In [6]:
   # a helper function to print the contents of the groups
   def print_groups (groupobject):
       # loop over all groups, printing the group name 
       # and group details
       for name, group in groupobject:
           print (name)
           print (group)

Using this method, we can see the results of the grouping we made a few steps back:

In [7]:
   # examine the content of the groups we created
   print_groups(grouped)

Out[7]:
   accel
       interval sensor axis  reading
   0          0  accel    Z      0.0
   1          0  accel    Y      0.5
   2          0  accel    X      1.0
   3          1  accel    Z      0.1
   4          1  accel    Y      0.4
   ..       ...    ...  ...      ...
   7          2  accel    Y      0.3
   8          2  accel    X      0.8
   9          3  accel    Z      0.3
   10         3  accel    Y      0.2
   11         3  accel    X      0.7

   [12 rows x 4 columns]
   orientation
       interval       sensor axis  reading
   12         0  orientation    Z      0.0
   13         0  orientation    Y      0.1
   14         0  orientation    X      0.0
   15         1  orientation    Z      0.0
   16         1  orientation    Y      0.2
   ..       ...          ...  ...      ...
   19         2  orientation    Y      0.3
   20         2  orientation    X      0.2
   21         3  orientation    Z      0.0
   22         3  orientation    Y      0.4
   23         3  orientation    X      0.3

   [12 rows x 4 columns]

The examination of these results gives us some insight into how pandas has performed the split that we specified. A group has been created for each distinct value in the sensors column and has been named with that value. The group contains a DataFrame object whose content is the rows where the sensor value matched the name of the group.

We can use the .size() method to get a summary of the size of all the groups:

In [8]:
   # get how many items are in each group
   grouped.size()

Out[8]:
   sensor
   accel          12
   orientation    12
   dtype: int64

The .count() method will return the number of items in each column of every group:

In [9]:
   # get the count of items in each column of each group
   grouped.count()

Out[9]:
                interval  axis  reading
   sensor                              
   accel              12    12       12
   orientation        12    12       12

Any specific group can be retrieved using the .get_group() property. The following retrieves the accel group:

In [10]:
   # get the data in one specific group
   grouped.get_group('accel')

Out[10]:
       interval sensor axis  reading
   0          0  accel    Z      0.0
   1          0  accel    Y      0.5
   2          0  accel    X      1.0
   3          1  accel    Z      0.1
   4          1  accel    Y      0.4
   ..       ...    ...  ...      ...
   7          2  accel    Y      0.3
   8          2  accel    X      0.8
   9          3  accel    Z      0.3
   10         3  accel    Y      0.2
   11         3  accel    X      0.7

   [12 rows x 4 columns]

The .head() and .tail() methods can be used to return the specified number of items in each group. The following code retrieved the first three rows in each group. There are six rows returned, as we requested the first three rows in each group, and there are two groups:

In [11]:
   # get the first three items in each group
   grouped.head(3)

Out[11]:
       interval       sensor axis  reading
   0          0        accel    Z      0.0
   1          0        accel    Y      0.5
   2          0        accel    X      1.0
   12         0  orientation    Z      0.0
   13         0  orientation    Y      0.1
   14         0  orientation    X      0.0

The .nth() method will return the n-th item in each group. The following demonstrates how to use this to retrieve the first, the second, and then the third row of each group:

In [12]:
   # get the first item in each group
   grouped.nth(0)

Out[12]:
                interval       sensor axis  reading
   sensor                                          
   accel               0        accel    Z        0
   orientation         0  orientation    Z        0

In [13]:
   # get the 2nd item in each group
   grouped.nth(1)

Out[13]:
                interval       sensor axis  reading
   sensor                                          
   accel               0        accel    Y      0.5
   orientation         0  orientation    Y      0.1

In [14]:
   # and so on...
   grouped.nth(2)

Out[14]:
                interval       sensor axis  reading
   sensor                                          
   accel               0        accel    X        1
   orientation         0  orientation    X        0

Grouping can also be performed on multiple columns by passing a list of column names. The following groups the data by both sensor and axis variables:

In [15]:
   # group by both sensor and axis values
   mcg = sensors.groupby(['sensor', 'axis'])
   print_groups(mcg)

Out[15]:
   ('accel', 'X')
       interval sensor axis  reading
   2          0  accel    X      1.0
   5          1  accel    X      0.9
   8          2  accel    X      0.8
   11         3  accel    X      0.7
   ('accel', 'Y')
       interval sensor axis  reading
   1          0  accel    Y      0.5
   4          1  accel    Y      0.4
   7          2  accel    Y      0.3
   10         3  accel    Y      0.2
   ('accel', 'Z')
      interval sensor axis  reading
   0         0  accel    Z      0.0
   3         1  accel    Z      0.1
   6         2  accel    Z      0.2
   9         3  accel    Z      0.3
   ('orientation', 'X')
       interval       sensor axis  reading
   14         0  orientation    X      0.0
   17         1  orientation    X      0.1
   20         2  orientation    X      0.2
   23         3  orientation    X      0.3
   ('orientation', 'Y')
       interval       sensor axis  reading
   13         0  orientation    Y      0.1
   16         1  orientation    Y      0.2
   19         2  orientation    Y      0.3
   22         3  orientation    Y      0.4
   ('orientation', 'Z')
       interval       sensor axis  reading
   12         0  orientation    Z        0
   15         1  orientation    Z        0
   18         2  orientation    Z        0
   21         3  orientation    Z        0

Since multiple columns were specified, the name of each group is now a tuple with the value from both sensor and axis, which represents the group.

The .describe() method can be used to return descriptive statistics for each group:

In [16]:
   # get descriptive statistics for each
   mcg.describe()

Out[16]:
                           interval   reading
   sensor      axis                          
   accel       X    count  4.000000  4.000000
                    mean   1.500000  0.850000
                    std    1.290994  0.129099
                    min    0.000000  0.700000
                    25%    0.750000  0.775000
   ...                          ...       ...
   orientation Z    min    0.000000  0.000000
                    25%    0.750000  0.000000
                    50%    1.500000  0.000000
                    75%    2.250000  0.000000
                    max    3.000000  0.000000

   [48 rows x 2 columns]

By default, groups are sorted by their group name in an ascending order. This dataset already has them in an ascending order, but if you want to prevent sorting during grouping, use the sort=False option.

Grouping using index levels

The examples up to this point, have used DataFrame without any specific indexing (just the default sequential numerical index). This type of data would actually be very well suited for a hierarchical index. This can then be used directly to group the data based upon index label(s).

To demonstrate, the following script creates a new DataFrame object with MultiIndex, consisting of the original sensor and interval columns:

In [17]:
   # make a copy of the data and reindex the copy
   mi = sensors.copy()
   mi = mi.set_index(['sensor', 'axis'])
   mi

Out[17]:
                     interval  reading
   sensor      axis                   
   accel       Z            0      0.0
               Y            0      0.5
               X            0      1.0
               Z            1      0.1
               Y            1      0.4
   ...                    ...      ...
   orientation Y            2      0.3
               X            2      0.2
               Z            3      0.0
               Y            3      0.4
               X            3      0.3

   [24 rows x 2 columns]

Grouping can now be performed using the levels of the hierarchical index. The following groups by index level 0 (the sensor names):

In [18]:
   # group by the first level of the index 
   mig_l1 = mi.groupby(level=0)
   print_groups(mig_l1)

Out[18]:
   accel
                interval  reading
   sensor axis                   
   accel  Z            0      0.0
          Y            0      0.5
          X            0      1.0
          Z            1      0.1
          Y            1      0.4
   ...               ...      ...
          Y            2      0.3
          X            2      0.8
          Z            3      0.3
          Y            3      0.2
          X            3      0.7

   [12 rows x 2 columns]
   orientation
                     interval  reading
   sensor      axis                   
   orientation Z            0      0.0
               Y            0      0.1
               X            0      0.0
               Z            1      0.0
               Y            1      0.2
   ...                    ...      ...
               Y            2      0.3
               X            2      0.2
               Z            3      0.0
               Y            3      0.4
               X            3      0.3

   [12 rows x 2 columns]

Grouping by multiple levels can be performed by passing the levels in a list to .groupby(). Also, if MultiIndex has names specified for the levels, then these names can be used instead of integers. The following code groups the two levels of MultiIndex by their names:

In [19]:
   # group by multiple levels of the index
   mig_l12 = mi.groupby(level=['sensor', 'axis'])
   print_groups(mig_l12)

Out[19]:
   ('accel', 'X')
                interval  reading
   sensor axis                   
   accel  X            0      1.0
          X            1      0.9
          X            2      0.8
          X            3      0.7
   ('accel', 'Y')
                interval  reading
   sensor axis                   
   accel  Y            0      0.5
          Y            1      0.4
          Y            2      0.3
          Y            3      0.2
   ('accel', 'Z')
                interval  reading
   sensor axis                   
   accel  Z            0      0.0
          Z            1      0.1
          Z            2      0.2
          Z            3      0.3
   ('orientation', 'X')
                     interval  reading
   sensor      axis                   
   orientation X            0      0.0
               X            1      0.1
               X            2      0.2
               X            3      0.3
   ('orientation', 'Y')
                     interval  reading
   sensor      axis                   
   orientation Y            0      0.1
               Y            1      0.2
               Y            2      0.3
               Y            3      0.4
   ('orientation', 'Z')
                     interval  reading
   sensor      axis                   
   orientation Z            0        0
               Z            1        0
               Z            2        0
               Z            3        0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.132.99