Data aggregation with pandas DataFrames

Data aggregation is a term known from relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features:

  1. Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns:
    • Weather (a string)
    • Food (also a string)
    • Price (a random float)
    • Number (a random integer between one and nine)

    The use case is that we have the results for some sort of a consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:

    import pandas as pd
    from numpy.random import seed
    from numpy.random import rand
    from numpy.random import random_integers
    import numpy as np
    
    seed(42)
    
    df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot',
       'cold', 'hot', 'cold'],
       'Food' : ['soup', 'soup', 'icecream', 'chocolate',
       'icecream', 'icecream', 'soup'],
       'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))})
    
    print df

    You should get an output similar to the following:

            Food  Number     Price Weather
    0       soup       8  3.745401    cold
    1       soup       5  9.507143     hot
    2   icecream       4  7.319939    cold
    3  chocolate       8  5.986585     hot
    4   icecream       8  1.560186    cold
    5   icecream       3  1.559945     hot
    6       soup       6  0.580836    cold
    
    [7 rows x 4 columns]
    

    Please note that the column labels come from the lexically ordered keys of the Python dict.

    Note

    Lexical or lexicographical order is based on the alphabetic order of characters in a string.

  2. Group the data by the Weather column and then iterate through the groups as follows:
    weather_group = df.groupby('Weather')
    
    i = 0
    
    for name, group in weather_group:
       i = i + 1
       print "Group", i, name
       print group

    We have two types of weather, hot and cold, so we get two groups:

    Group 1 cold
           Food  Number     Price Weather
    0      soup       8  3.745401    cold
    2  icecream       4  7.319939    cold
    4  icecream       8  1.560186    cold
    6      soup       6  0.580836    cold
    
    [4 rows x 4 columns]
    Group 2 hot
            Food  Number     Price Weather
    1       soup       5  9.507143     hot
    3  chocolate       8  5.986585     hot
    5   icecream       3  1.559945     hot
    
    [3 rows x 4 columns]
    
  3. The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows:
    print "Weather group first
    ", weather_group.first()
    print "Weather group last
    ", weather_group.last()
    print "Weather group mean
    ", weather_group.mean()

    The preceding code snippet prints the first row, last row, and mean of each group:

    Weather group first
             Food  Number     Price
    Weather                        
    cold     soup       8  3.745401
    hot      soup       5  9.507143
    
    [2 rows x 3 columns]
    Weather group last
                 Food  Number     Price
    Weather                            
    cold         soup       6  0.580836
    hot      icecream       3  1.559945
    
    [2 rows x 3 columns]
    Weather group mean
               Number     Price
    Weather                    
    cold     6.500000  3.301591
    hot      5.333333  5.684558
    
    [2 rows x 2 columns]
    
  4. Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed and the rows in each group:
    wf_group = df.groupby(['Weather', 'Food'])
    print "WF Groups", wf_group.groups

    For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:

    WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6]}
    
  5. Apply a list of NumPy functions on groups with the agg() method:
    print "WF Aggregated
    ", wf_group.agg([np.mean, np.median])

    Obviously, we could apply even more functions, but it would look messier than the following output:

    WF Aggregated
                       Number             Price
                         mean  median      mean    median
    Weather Food
    cold    icecream        6       6  4.440063  4.440063
            soup            7       7  2.163119  2.163119
    hot     chocolate       8       8  5.986585  5.986585
            icecream        3       3  1.559945  1.559945
            soup            5       5  9.507143  9.507143
    
    [5 rows x 4 columns]
    

The full data aggregation example code is in the data_aggregation.py file, which can be found in this book's code bundle:

import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import random_integers
import numpy as np

seed(42)

df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot',
   'cold', 'hot', 'cold'],
   'Food' : ['soup', 'soup', 'icecream', 'chocolate',
   'icecream', 'icecream', 'soup'],
   'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))})

print df
weather_group = df.groupby('Weather')

i = 0

for name, group in weather_group:
   i = i + 1
   print "Group", i, name
   print group

print "Weather group first
", weather_group.first()
print "Weather group last
", weather_group.last()
print "Weather group mean
", weather_group.mean()

wf_group = df.groupby(['Weather', 'Food'])
print "WF Groups", wf_group.groups

print "WF Aggregated
", wf_group.agg([np.mean, np.median])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.17.12