Chapter 5. Operations in pandas, Part II – Grouping, Merging, and Reshaping of Data

In this chapter, we tackle the question of rearranging data in our data structures. We examine the various functions that enable us to rearrange data by utilizing them on real-world datasets. Such functions include groupby, concat, aggregate, append, and so on. The topics that we'll discuss are as follow:

  • Aggregation/grouping of data
  • Merging and concatenating data
  • Reshaping data

Grouping of data

We often detail granular data that we wish to aggregate or combine based on a grouping variable. We will illustrate some ways of doing this in the following sections.

The groupby operation

The groupby operation can be thought of as part of a process that involves the following three steps:

  • Splitting the dataset
  • Analyzing the data
  • Aggregating or combining the data

The groupby clause is an operation on DataFrames. A Series is a 1D object, so performing a groupby operation on it is not very useful. However, it can be used to obtain distinct rows of the Series. The result of a groupby operation is not a DataFrame but dict of DataFrame objects. Let us start with a dataset involving the world's most popular sport—soccer.

This dataset, obtained from Wikipedia, contains data for the finals of the European club championship since its inception in 1955. For reference, you can go to http://en.wikipedia.org/wiki/UEFA_Champions_League.

Convert the .csv file into a DataFrame by using the following command:

In [27]: uefaDF=pd.read_csv('./euro_winners.csv')
In [28]: uefaDF.head()
Out[28]:
The groupby operation

Thus, the output shows the season, the nations to which the winning and runner-up clubs belong, the score, the venue, and the attendance figures. Suppose we wanted to rank the nations by the number of European club championships they had won. We can do this by using groupby. First, we apply groupby to the DataFrame and see what is the type of the result:

In [84]: nationsGrp =uefaDF.groupby('Nation'),
         type(nationsGrp)
Out[84]: pandas.core.groupby.DataFrameGroupBy

Thus, we see that nationsGrp is of the pandas.core.groupby.DataFrameGroupBy type. The column on which we use groupby is referred to as the key. We can see what the groups look like by using the groups attribute on the resulting DataFrameGroupBy object:

In [97]: nationsGrp.groups
Out[97]: {'England': [12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52,
                      56],
     'France': [37],
    'Germany': [18, 19, 20, 27, 41, 45, 57],
     'Italy': [7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54],
     'Netherlands': [14, 15, 16, 17, 32, 39],
     'Portugal': [5, 6, 31, 48],
     'Romania': [30],
     'Scotland': [11],
     'Spain': [0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55],
     'Yugoslavia': [35]}

This is basically a dictionary that just shows the unique groups and the axis labels corresponding to each group—in this case the row number. The number of groups is obtained by using the len() function:

In [109]: len(nationsGrp.groups)
Out[109]: 10

We can now display the number of wins of each nation in descending order by applying the size() function to the group and subsequently the sort() function, which sorts according to place:

In [99]: nationWins=nationsGrp.size()   
In [100] nationWins.sort(ascending=False)
         nationWins
Out[100]: Nation
    Spain          13
    Italy          12
    England        12
    Germany         7
    Netherlands     6
    Portugal        4
    Yugoslavia      1
    Scotland        1
    Romania         1
    France          1
    dtype: int64

The size() function returns a Series with the group names as the index and the size of each group. The size() function is also an aggregation function. We will examine aggregation functions later in the chapter.

To do a further breakup of wins by country and club, we apply a multicolumn groupby function before applying size() and sort():

In [106]: winnersGrp =uefaDF.groupby(['Nation','Winners'])
          clubWins=winnersGrp.size()
          clubWins.sort(ascending=False)
          clubWins
Out[106]: Nation       Winners          
          Spain        Real Madrid          9
          Italy        Milan                7
          Germany      Bayern Munich        5
          England      Liverpool            5
          Spain        Barcelona            4
          Netherlands  Ajax                 4
          England      Manchester United    3
          Italy        Internazionale       3
                       Juventus             2
          Portugal     Porto                2
                       Benfica              2
          England      Nottingham Forest    2
                       Chelsea              1
          France       Marseille            1
          Yugoslavia   Red Star Belgrade    1
          Germany      Borussia Dortmund    1
                       Hamburg              1
          Netherlands  Feyenoord            1
                       PSV Eindhoven        1
          Romania      Steaua Bucuresti     1
          Scotland     Celtic               1
          England      Aston Villa          1
          dtype: int64

A multicolumn groupby specifies more than one column to be used as the key by specifying the key columns as a list. Thus, we can see that the most successful club in this competition has been Real Madrid of Spain. We now examine a richer dataset that will enable us to illustrate many more features of groupby. This dataset is also soccer related and provides statistics for the top four European soccer leagues in the 2012-2013 season:

  • English Premier League or EPL
  • Spanish Primera Division or La Liga
  • Italian First Division or Serie A
  • German Premier League or Bundesliga

The source of this information is at http://soccerstats.com.

Let us now read the goal stats data into a DataFrame as usual. In this case, we create a row index on the DataFrame using the month:

In [68]: goalStatsDF=pd.read_csv('./goal_stats_euro_leagues_2012-13.csv')
         goalStatsDF=goalStatsDF.set_index('Month')

We look at the snapshot of the head and tail ends of our dataset:

In [115]: goalStatsDF.head(3)
Out[115]:         Stat          EPL  La Liga Serie A  Bundesliga
   Month
   08/01/2012  MatchesPlayed  20     20      10       10
  09/01/2012  MatchesPlayed  38     39      50       44
   10/01/2012  MatchesPlayed  31     31      39       27

In [116]: goalStatsDF.tail(3)
Out[116]:         Stat         EPL  La Liga Serie A  Bundesliga
      Month
      04/01/2013  GoalsScored  105   127     102      104
     05/01/2013  GoalsScored   96   109     102      92
     06/01/2013  GoalsScored NaN   80     NaN      NaN

There are two measures in this data frame—MatchesPlayed and GoalsScored—and the data is ordered first by Stat and then by Month. Note that the last row in the tail() output has the NaN values for all the columns except La Liga but we'll discuss this in more detail later. We can use groupby to display the stats, but this will be done by grouped year instead. Here is how this is done:

In [117]: goalStatsGroupedByYear = goalStatsDF.groupby(
lambda Month: Month.split('/')[2])

We can then iterate over the resulting groupby object and display the groups. In the following command, we see the two sets of statistics grouped by year. Note the use of the lambda function to obtain the year group from the first day of the month. For more information about lambda functions, go to http://bit.ly/1apJNwS:

In [118]: for name, group in goalStatsGroupedByYear:
            print name
            print group
    2012
                Stat  EPL  La Liga  Serie A  Bundesliga
          Month
    08/01/2012  MatchesPlayed   20       20       10          10
    09/01/2012  MatchesPlayed   38       39       50          44
    10/01/2012  MatchesPlayed   31       31       39          27
    11/01/2012  MatchesPlayed   50       41       42          46
    12/01/2012  MatchesPlayed   59       39       39          26
    08/01/2012    GoalsScored   57       60       21          23
    09/01/2012    GoalsScored  111      112      133         135
    10/01/2012    GoalsScored   95       88       97          77
    11/01/2012    GoalsScored  121      116      120         137
    12/01/2012    GoalsScored  183      109      125          72
        2013
                          Stat  EPL  La Liga  Serie A  Bundesliga
          Month
    01/01/2013  MatchesPlayed   42       40       40          18
    02/01/2013  MatchesPlayed   30       40       40          36
    03/01/2013  MatchesPlayed   35       38       39          36
    04/01/2013  MatchesPlayed   42       42       41          36
    05/01/2013  MatchesPlayed   33       40       40          27
    06/02/2013  MatchesPlayed  NaN       10      NaN         NaN
    01/01/2013    GoalsScored  117      121      104          51
    02/01/2013    GoalsScored   87      110      100         101
    03/01/2013    GoalsScored   91      101       99         106
    04/01/2013    GoalsScored  105      127      102         104
    05/01/2013    GoalsScored   96      109      102          92
    06/01/2013    GoalsScored  NaN       80      NaN         NaN

If we wished to group by individual month instead, we would need to apply groupby with a level argument, as follows:

In [77]: goalStatsGroupedByMonth = goalStatsDF.groupby(level=0)

In [81]: for name, group in goalStatsGroupedByMonth:
           print name
             print group
           print "
"

01/01/2013
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
01/01/2013  MatchesPlayed   42       40       40          18
01/01/2013    GoalsScored  117      121      104          51


02/01/2013
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month                                                       
02/01/2013  MatchesPlayed   30       40       40          36
02/01/2013    GoalsScored   87      110      100         101


03/01/2013
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
03/01/2013  MatchesPlayed   35       38       39          36
03/01/2013    GoalsScored   91      101       99         106


04/01/2013
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
04/01/2013  MatchesPlayed   42       42       41          36
04/01/2013    GoalsScored  105      127      102         104


05/01/2013
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
05/01/2013  MatchesPlayed   33       40       40          27
05/01/2013    GoalsScored   96      109      102          92


06/01/2013
                   Stat  EPL  La Liga  Serie A  Bundesliga
Month
06/01/2013  GoalsScored  NaN       80      NaN         NaN

06/02/2013
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
06/02/2013  MatchesPlayed  NaN       10      NaN         NaN


08/01/2012
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
08/01/2012  MatchesPlayed   20       20       10          10
08/01/2012    GoalsScored   57       60       21          23

09/01/2012
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
09/01/2012  MatchesPlayed   38       39       50          44
09/01/2012    GoalsScored  111      112      133         135

10/01/2012
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
10/01/2012  MatchesPlayed   31       31       39          27
10/01/2012    GoalsScored   95       88       97          77

11/01/2012
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
11/01/2012  MatchesPlayed   50       41       42          46
11/01/2012    GoalsScored  121      116      120         137

12/01/2012
                     Stat  EPL  La Liga  Serie A  Bundesliga
Month
12/01/2012  MatchesPlayed   59       39       39          26
12/01/2012    GoalsScored  183      109      125          72

Note that since in the preceding commands we're grouping on an index, we need to specify the level argument as opposed to just using a column name. When we group by multiple keys, the resulting group name is a tuple, as shown in the upcoming commands. First, we reset the index to obtain the original DataFrame and define a MultiIndex in order to be able to group by multiple keys. If this is not done, it will result in a ValueError:

In [246]: goalStatsDF=goalStatsDF.reset_index()
          goalStatsDF=goalStatsDF.set_index(['Month','Stat'])

In [247]: monthStatGroup=goalStatsDF.groupby(level=['Month','Stat'])

In [248]: for name, group in monthStatGroup:
          print name
          print group

('01/01/2013', 'GoalsScored')
                          EPL  La Liga  Serie A  Bundesliga
Month      Stat
01/01/2013 GoalsScored    117      121   104      51
('01/01/2013', 'MatchesPlayed')
                           EPL  La Liga  Serie A  Bundesliga
Month      Stat
01/01/2013 MatchesPlayed   42       40    40       18
('02/01/2013', 'GoalsScored')
                         EPL  La Liga  Serie A  Bundesliga
Month      Stat
02/01/2013 GoalsScored   87      110   100      101

Using groupby with a MultiIndex

If our DataFrame has a MultiIndex, we can use groupby to group by different levels of the hierarchy and compute some interesting statistics. Here is the goal stats data using a MultiIndex consisting of Month and then Stat:

In [134]:goalStatsDF2=pd.read_csv('./goal_stats_euro_leagues_2012-13.csv')
       goalStatsDF2=goalStatsDF2.set_index(['Month','Stat'])
In [141]: print goalStatsDF2.head(3)
          print goalStatsDF2.tail(3)
                          EPL  La Liga  Serie A  Bundesliga
Month      Stat
08/01/2012 MatchesPlayed   20       20       10          10
09/01/2012 MatchesPlayed   38       39       50          44
10/01/2012 MatchesPlayed   31       31       39          27
                        EPL  La Liga  Serie A  Bundesliga
Month      Stat
04/01/2013 GoalsScored  105      127      102         104
05/01/2013 GoalsScored   96      109      102          92
06/01/2013 GoalsScored  NaN       80      NaN         NaN

Suppose we wish to compute the total number of goals scored and the total matches played for the entire season for each league, we could do this as follows:

In [137]: grouped2=goalStatsDF2.groupby(level='Stat')
In [139]: grouped2.sum()
Out[139]:         EPL   La Liga  Serie A  Bundesliga   Stat
          GoalsScored   1063  1133     1003  898
          MatchesPlayed 380    380      380  306

Incidentally, the same result as the preceding one can be obtained by using sum directly and passing the level as a parameter:

In [142]: goalStatsDF2.sum(level='Stat')
Out[142]:            EPL   La Liga  Serie A  Bundesliga   Stat
          GoalsScored    1063  1133     1003  898
          MatchesPlayed   380  380      380  306

Now, let us obtain a key statistic to determine how exciting the season was in each of the leagues—the goals per game ratio:

In [174]: totalsDF=grouped2.sum()

In [175]: totalsDF.ix['GoalsScored']/totalsDF.ix['MatchesPlayed']
Out[175]: EPL           2.797368
          La Liga       2.981579
          Serie A       2.639474
          Bundesliga    2.934641
          dtype: float64

This is returned as a Series, as shown in the preceding command. We can now display the goals per game ratio along with the goals scored and matches played to give a summary of how exciting the league was, as follows:

  1. Obtain goals per game data as a DataFrame. Note that we have to transpose it since gpg is returned as a Series:
    In [234]: gpg=totalsDF.ix['GoalsScored']/totalsDF.ix['MatchesPlayed']
           goalsPerGameDF=pd.DataFrame(gpg).T
    
    In [235]: goalsPerGameDF
    Out[235]:     EPL   La Liga   Serie A   Bundesliga
            0   2.797368   2.981579  2.639474  2.934641
    
  2. Reindex the goalsPerGameDF DataFrame so that the 0 index is replaced by GoalsPerGame:
    In [207]: goalsPerGameDF=goalsPerGameDF.rename(index={0:'GoalsPerGame'}) 
    
    In [208]: goalsPerGameDF
    Out[208]:          EPL      La Liga   Serie A   Bundesliga
            GoalsPerGame  2.797368  2.981579  2.639474  2.934641
    
  3. Append the goalsPerGameDF DataFrame to the original one:
    In [211]: pd.options.display.float_format='{:.2f}'.format
                totalsDF.append(goalsPerGameDF)
    Out[211]:      EPL    La Liga     Serie A   Bundesliga
      GoalsScored    1063.00  1133.00  1003.00   898.00
      MatchesPlayed  380.00    380.00   380.00   306.00
      GoalsPerGame      2.80    2.98     2.64   2.93
    

The following is a graph that shows the goals per match of the European leagues, that we discussed, from 1955-2012. The source for this can be found at http://mattstil.es/images/europe-football.png.

Using groupby with a MultiIndex

Using the aggregate method

Another way to generate summary statistics is by using the aggregate method explicitly:

In [254]: pd.options.display.float_format=None
In [256]: grouped2.aggregate(np.sum)
Out[256]:       EPL  La Liga  Serie A  Bundesliga   Stat
          GoalsScored     1063  1133  1003   898
          MatchesPlayed  380    380   380   306

This generates a grouped DataFrame object that is shown in the preceding command. We also reset the float format to None, so the integer-valued data would not be shown as floats due to the formatting from the previous section.

Applying multiple functions

For a grouped DataFrame object, we can specify a list of functions to be applied to each column:

In [274]: grouped2.agg([np.sum, np.mean,np.size])
Out[274]:      EPL          La Liga      Serie A        Bundesliga
       sum mean size  sum mean size  sum mean size sum mean size Stat
       GoalsScored  1063 106.3 11 1133 103.0 11 1003 100.3 11 898 89.8  11
       MatchesPlayed 380 38.0 11  380 34.6  11  380 38.0 11  306 30.6  11

Note the preceding output that shows NA values are excluded from aggregate calculations. The agg is an abbreviation form for aggregate. Thus, the calculations for the mean for EPL, Serie A, and Bundesliga are based on a size of 10 months and not 11. This is because no matches were played in the last month of June in these three leagues as opposed to La Liga, which had matches in June.

In the case of a grouped Series, we return to the nationsGrp example and compute some statistics on the attendance figures for the country of the tournament winners:

In [297]: nationsGrp['Attendance'].agg({'Total':np.sum, 'Average':np.mean, 'Deviation':np.std})
Out[297]:       Deviation   Average     Total
         Nation
         England    17091.31    66534.25   798411
         France     NaN         64400      64400
         Germany    13783.83    67583.29   473083
         Italy       17443.52    65761.25   789135
         Netherlands 16048.58   67489.0    404934
         Portugal    15632.86   49635.5    198542
         Romania     NaN      70000       70000
         Scotland    NaN      45000        45000
         Spain        27457.53   73477.15   955203
         Yugoslavia  NaN      56000      56000

For a grouped Series, we can pass a list or dict of functions. In the preceding case, a dict was specified and the key values were used for the names of the columns in the resulting DataFrame. Note that in the case of groups of a single sample size, the standard deviation is undefined and NaN is the result—for example, Romania.

The transform() method

The groupby-transform function is used to perform transformation operations on a groupby object. For example, we could replace NaN values in the groupby object using the fillna method. The resulting object after using transform has the same size as the original groupby object. Let us consider a DataFrame showing the goals scored for each month in the four soccer leagues:

In[344]: goalStatsDF3=pd.read_csv('./goal_stats_euro_leagues_2012-13.csv')
goalStatsDF3=goalStatsDF3.set_index(['Month'])
goalsScoredDF=goalStatsDF3.ix[goalStatsDF3['Stat']=='GoalsScored']

goalsScoredDF.iloc[:,1:]
Out[344]:        EPL  La Liga  Serie A  Bundesliga
Month
08/01/2012   57   60    21         23
09/01/2012   111   112    133        135
10/01/2012   95   88    97         77
11/01/2012   121   116    120        137
12/01/2012   183   109    125         72
01/01/2013   117   121    104         51
02/01/2013   87   110    100        101
03/01/2013   91   101    99        106
04/01/2013   105   127    102        104
05/01/2013   96   109    102         92
06/01/2013   NaN   80    NaN        NaN

We can see that for June 2013, the only league for which matches were played was La Liga, resulting in the NaN values for the other three leagues. Let us group the data by year:

In [336]: goalsScoredPerYearGrp=goalsScoredDF.groupby(lambda Month: Month.split('/')[2])
          goalsScoredPerYearGrp.mean()
Out[336]:           EPL    La Liga   Serie A  Bundesliga
         2012       113.4   97        99.2     88.8
         2013       99.2    108       101.4    90.8

The preceding function makes use of a lambda function to obtain the year by splitting the Month variable on the / character and taking the third element of the resulting list.

If we do a count of the number of months per year during which matches were held in the various leagues, we have:

In [331]: goalsScoredPerYearGrp.count()
Out[331]:         EPL  La Liga  Serie A  Bundesliga
         2012     5     5       5         5
         2013     5     6       5         5

It is often undesirable to display data with missing values and one common method to resolve this situation would be to replace the missing values with the group mean. This can be achieved using the transform-groupby function. First, we must define the transformation using a lambda function and then apply this transformation using the transform method:

In [338]: fill_fcn = lambda x: x.fillna(x.mean())
            trans = goalsScoredPerYearGrp.transform(fill_fcn)
         tGroupedStats = trans.groupby(lambda Month:   Month.split('/')[2])
         tGroupedStats.mean() 
Out[338]:           EPL     La Liga   Serie A  Bundesliga
         2012       113.4   97        99.2     88.8
         2013       99.2    108       101.4    90.8

One thing to note from the preceding results is that replacing the NaN values with the group mean in the original group, keeps the group means unchanged in the transformed data.

However, when we do a count on the transformed group, we see that the number of matches has changed from five to six for the EPL, Serie A, and Bundesliga:

In [339]: tGroupedStats.count()
Out[339]:        EPL    La Liga   Serie A  Bundesliga
         2012     5     5         5          5
         2013     6     6         6          6

Filtering

The filter method enables us to apply filtering on a groupby object that results in a subset of the initial object. Here, we illustrate how to display the months of the season in which more than 100 goals were scored in each of the four leagues:

In [391]:  goalsScoredDF.groupby(level='Month').filter(lambda x:  
                                 np.all([x[col] > 100 
              for col in goalsScoredDF.columns]))
Out[391]:            EPL  La Liga  Serie A  Bundesliga
         Month
        09/01/2012   111   112       133     135
        11/01/2012   121   116       120     137
        04/01/2013   105   127       102     104

Note the use of the np.all operator to ensure that the constraint is enforced for all the columns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.176.145