Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data aggregation with pandas DataFrames

Data aggregation is a term known from relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features:

Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns:

Weather (a string)
Food (also a string)
Price (a random float)
Number (a random integer between one and nine)

The use case is that we have the results for some sort of a consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:

import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import random_integers
import numpy as np

seed(42)

df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot',
   'cold', 'hot', 'cold'],
   'Food' : ['soup', 'soup', 'icecream', 'chocolate',
   'icecream', 'icecream', 'soup'],
   'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))})

print df

You should get an output similar to the following:

        Food  Number     Price Weather
0       soup       8  3.745401    cold
1       soup       5  9.507143     hot
2   icecream       4  7.319939    cold
3  chocolate       8  5.986585     hot
4   icecream       8  1.560186    cold
5   icecream       3  1.559945     hot
6       soup       6  0.580836    cold

[7 rows x 4 columns]

Please note that the column labels come from the lexically ordered keys of the Python dict.

Note

Lexical or lexicographical order is based on the alphabetic order of characters in a string.

Group the data by the Weather column and then iterate through the groups as follows:

weather_group = df.groupby('Weather')

i = 0

for name, group in weather_group:
   i = i + 1
   print "Group", i, name
   print group

We have two types of weather, hot and cold, so we get two groups:

Group 1 cold
       Food  Number     Price Weather
0      soup       8  3.745401    cold
2  icecream       4  7.319939    cold
4  icecream       8  1.560186    cold
6      soup       6  0.580836    cold

[4 rows x 4 columns]
Group 2 hot
        Food  Number     Price Weather
1       soup       5  9.507143     hot
3  chocolate       8  5.986585     hot
5   icecream       3  1.559945     hot

[3 rows x 4 columns]

The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows:

print "Weather group first
", weather_group.first()
print "Weather group last
", weather_group.last()
print "Weather group mean
", weather_group.mean()

The preceding code snippet prints the first row, last row, and mean of each group:

Weather group first
         Food  Number     Price
Weather                        
cold     soup       8  3.745401
hot      soup       5  9.507143

[2 rows x 3 columns]
Weather group last
             Food  Number     Price
Weather                            
cold         soup       6  0.580836
hot      icecream       3  1.559945

[2 rows x 3 columns]
Weather group mean
           Number     Price
Weather                    
cold     6.500000  3.301591
hot      5.333333  5.684558

[2 rows x 2 columns]

Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed and the rows in each group:
```
wf_group = df.groupby(['Weather', 'Food'])
print "WF Groups", wf_group.groups
```
For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:
```
WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6]}
```

Apply a list of NumPy functions on groups with the agg() method:

print "WF Aggregated
", wf_group.agg([np.mean, np.median])

Obviously, we could apply even more functions, but it would look messier than the following output:

WF Aggregated
                   Number             Price
                     mean  median      mean    median
Weather Food
cold    icecream        6       6  4.440063  4.440063
        soup            7       7  2.163119  2.163119
hot     chocolate       8       8  5.986585  5.986585
        icecream        3       3  1.559945  1.559945
        soup            5       5  9.507143  9.507143

[5 rows x 4 columns]

The full data aggregation example code is in the data_aggregation.py file, which can be found in this book's code bundle:

import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import random_integers
import numpy as np

seed(42)

df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold', 'hot',
   'cold', 'hot', 'cold'],
   'Food' : ['soup', 'soup', 'icecream', 'chocolate',
   'icecream', 'icecream', 'soup'],
   'Price' : 10 * rand(7), 'Number' : random_integers(1, 9, size=(7,))})

print df
weather_group = df.groupby('Weather')

i = 0

for name, group in weather_group:
   i = i + 1
   print "Group", i, name
   print group

print "Weather group first
", weather_group.first()
print "Weather group last
", weather_group.last()
print "Weather group mean
", weather_group.mean()

wf_group = df.groupby(['Weather', 'Food'])
print "WF Groups", wf_group.groups

print "WF Aggregated
", wf_group.agg([np.mean, np.median])

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data aggregation with pandas DataFrames

Create new playlist

Sign In

Sign Up

Data aggregation with pandas DataFrames

Note

Table of Contents for
Data aggregation with pandas DataFrames