Aggregation and GroupBy

Sometimes, we may wish to split data into subsets and apply a function such as the mean, max, or min to each subset. In R, we can do this via the aggregate or tapply functions.

Here, we will use the example of a dataset of statistics on the top five strikers of the four clubs that made it to the semi-final of the European Champions League Football tournament in 2014. We will use it to illustrate aggregation in R and its equivalent GroupBy functionality in pandas.

Aggregation in R

In R aggregation is done using the following command:

> goal_stats=read.csv('champ_league_stats_semifinalists.csv')
>goal_stats
              Club                 Player Goals GamesPlayed
1  Atletico Madrid            Diego Costa     8           9
2  Atletico Madrid             ArdaTuran     4           9
3  Atletico Madrid            RaúlGarcía     4          12
4  Atletico Madrid           AdriánLópez     2           9
5  Atletico Madrid            Diego Godín     2          10
6      Real Madrid      Cristiano Ronaldo    17          11
7      Real Madrid            Gareth Bale     6          12
8      Real Madrid          Karim Benzema     5          11
9      Real Madrid                   Isco     3          12
10     Real Madrid         Ángel Di María     3          11
11   Bayern Munich          Thomas Müller     5          12
12   Bayern Munich           ArjenRobben     4          10
13   Bayern Munich            Mario Götze     3          11
14   Bayern Munich Bastian Schweinsteiger     3           8
15   Bayern Munich        Mario Mandžukić     3          10
16         Chelsea        Fernando Torres     4           9
17         Chelsea               Demba Ba     3           6
18         Chelsea           Samuel Eto'o     3           9
19         Chelsea            Eden Hazard     2           9
20         Chelsea                Ramires     2          10

We can now compute the goals per game ratio for each striker, to measure their deadliness in front of a goal:

>goal_stats$GoalsPerGame<- goal_stats$Goals/goal_stats$GamesPlayed
>goal_stats
              Club   Player         Goals GamesPlayedGoalsPerGame
1  Atletico Madrid  Diego Costa     8           9    0.8888889
2  Atletico Madrid  ArdaTuran      4           9    0.4444444
3  Atletico Madrid  RaúlGarcía     4          12    0.3333333
4  Atletico Madrid  AdriánLópez    2           9    0.2222222
5  Atletico Madrid  Diego Godín     2          10    0.2000000
6  Real Madrid  Cristiano Ronaldo  17          11    1.5454545
7  Real Madrid  Gareth Bale         6          12    0.5000000
8  Real Madrid    Karim Benzema     5          11    0.4545455
9  Real Madrid       Isco           3          12    0.2500000
10 Real Madrid  Ángel Di María     3          11    0.2727273
11 Bayern Munich Thomas Müller     5          12    0.4166667
12 Bayern Munich  ArjenRobben     4          10    0.4000000
13 Bayern Munich  MarioGötze      3          11    0.2727273
14 Bayern Munich Bastian Schweinsteiger 3      8    0.3750000
15 Bayern Munich  MarioMandžukić  3          10    0.3000000
16 Chelsea       Fernando Torres   4           9    0.4444444
17 Chelsea           Demba Ba      3           6    0.5000000
18 Chelsea           Samuel Eto'o  3           9    0.3333333
19 Chelsea            Eden Hazard  2           9    0.2222222
20 Chelsea                Ramires  2          10    0.2000000

Let's suppose that we wanted to know the highest goals per game ratio for each team. We would calculate this as follows:

>aggregate(x=goal_stats[,c('GoalsPerGame')], by=list(goal_stats$Club),FUN=max)
          Group.1         x
1 Atletico Madrid 0.8888889
2   Bayern Munich 0.4166667
3         Chelsea 0.5000000
4     Real Madrid 1.5454545

The tapply function is used to apply a function to a subset of an array or vector that is defined by one or more columns. The tapply function can also be used as follows:

>tapply(goal_stats$GoalsPerGame,goal_stats$Club,max)
Atletico Madrid   Bayern Munich         Chelsea     Real Madrid 
      0.8888889       0.4166667       0.5000000       1.5454545

The pandas' GroupBy operator

In pandas, we can achieve the same result by using the GroupBy function:

In [6]: import pandas as pd
importnumpy as np
In [7]: goal_stats_df=pd.read_csv('champ_league_stats_semifinalists.csv')

In [27]: goal_stats_df['GoalsPerGame']=     goal_stats_df['Goals']/goal_stats_df['GamesPlayed']

In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed']

In [28]: goal_stats_df
Out[28]: Club           Player      Goals GamesPlayedGoalsPerGame
0       Atletico Madrid Diego Costa   8       9        0.888889
1       Atletico Madrid ArdaTuran    4       9         0.444444
2       Atletico Madrid RaúlGarcía   4       12        0.333333
3       Atletico Madrid AdriánLópez  2       9         0.222222
4       Atletico Madrid Diego Godín   2       10        0.200000
5       Real Madrid  Cristiano Ronaldo 17      11        1.545455
6       Real Madrid     Gareth Bale   6       12        0.500000
7       Real Madrid     Karim Benzema 5       11        0.454545
8       Real Madrid     Isco          3       12        0.250000
9       Real Madrid     Ángel Di María 3      11        0.272727
10      Bayern Munich   Thomas Müller  5       12        0.416667
11      Bayern Munich   ArjenRobben   4       10        0.400000
12      Bayern Munich   Mario Götze    3       11        0.272727
13      Bayern Munich  BastianSchweinsteiger 3   8     0.375000
14      Bayern Munich  MarioMandžukić  3       10        0.300000
15      Chelsea        Fernando Torres  4       9         0.444444
16      Chelsea        Demba Ba         3       6         0.500000
17      Chelsea        Samuel Eto'o     3       9         0.333333
18      Chelsea        Eden Hazard      2       9         0.222222
19      Chelsea        Ramires          2       10        0.200000

In [30]: grouped = goal_stats_df.groupby('Club')

In [17]: grouped['GoalsPerGame'].aggregate(np.max)
Out[17]: Club
         Atletico Madrid    0.888889
         Bayern Munich      0.416667
         Chelsea            0.500000
         Real Madrid        1.545455
         Name: GoalsPerGame, dtype: float64

In [22]: grouped['GoalsPerGame'].apply(np.max)

Out[22]: Club
         Atletico Madrid    0.888889
         Bayern Munich      0.416667
         Chelsea            0.500000
         Real Madrid        1.545455
         Name: GoalsPerGame, dtype: float64
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.253.210