Data aggregation is a term known from relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features:
Weather
(a string)Food
(also a string)Price
(a random float)Number
(a random integer between one and nine)The use case is that we have the results for some sort of a consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:
import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import random_integers import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))}) print df
You should get an output similar to the following:
Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 3 chocolate 8 5.986585 hot 4 icecream 8 1.560186 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [7 rows x 4 columns]
Please note that the column labels come from the lexically ordered keys of the Python dict.
Weather
column and then iterate through the groups as follows:weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print "Group", i, name print group
We have two types of weather, hot and cold, so we get two groups:
Group 1 cold Food Number Price Weather 0 soup 8 3.745401 cold 2 icecream 4 7.319939 cold 4 icecream 8 1.560186 cold 6 soup 6 0.580836 cold [4 rows x 4 columns] Group 2 hot Food Number Price Weather 1 soup 5 9.507143 hot 3 chocolate 8 5.986585 hot 5 icecream 3 1.559945 hot [3 rows x 4 columns]
weather_group
variable is a special pandas object that we get as a result of the groupby()
method. This object has aggregation methods, which are demonstrated as follows:print "Weather group first ", weather_group.first() print "Weather group last ", weather_group.last() print "Weather group mean ", weather_group.mean()
The preceding code snippet prints the first row, last row, and mean of each group:
Weather group first Food Number Price Weather cold soup 8 3.745401 hot soup 5 9.507143 [2 rows x 3 columns] Weather group last Food Number Price Weather cold soup 6 0.580836 hot icecream 3 1.559945 [2 rows x 3 columns] Weather group mean Number Price Weather cold 6.500000 3.301591 hot 5.333333 5.684558 [2 rows x 2 columns]
groups
attribute will then tell us the groups that are formed and the rows in each group:wf_group = df.groupby(['Weather', 'Food']) print "WF Groups", wf_group.groups
For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:
WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6]}
agg()
method:print "WF Aggregated ", wf_group.agg([np.mean, np.median])
Obviously, we could apply even more functions, but it would look messier than the following output:
WF Aggregated Number Price mean median mean median Weather Food cold icecream 6 6 4.440063 4.440063 soup 7 7 2.163119 2.163119 hot chocolate 8 8 5.986585 5.986585 icecream 3 3 1.559945 1.559945 soup 5 5 9.507143 9.507143 [5 rows x 4 columns]
The full data aggregation example code is in the data_aggregation.py
file, which can be found in this book's code bundle:
import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import random_integers import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))}) print df weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print "Group", i, name print group print "Weather group first ", weather_group.first() print "Weather group last ", weather_group.last() print "Weather group mean ", weather_group.mean() wf_group = df.groupby(['Weather', 'Food']) print "WF Groups", wf_group.groups print "WF Aggregated ", wf_group.agg([np.mean, np.median])
3.135.187.210