Pivots and reshaping data

This section deals with how you can reshape data. Sometimes, data is stored in what is known as the stacked format. Here is an example of a stacked data using the PlantGrowth dataset:

In [344]: plantGrowthRawDF=pd.read_csv('./PlantGrowth.csv')
       plantGrowthRawDF
Out[344]:     observation   weight  group
         0    1             4.17    ctrl
         1    2             5.58    ctrl
         2    3             5.18    ctrl
         ...
         10   1             4.81    trt1
         11   2             4.17    trt1
         12   3             4.41    trt1
        ... 
         20   1             6.31    trt2
         21   2             5.12    trt2
         22   3             5.54    trt2

This data consists of results from an experiment to compare dried weight yields of plants that were obtained under a control (ctrl) and two different treatment conditions (trt1, trt2). Suppose we wanted to do some analysis of this data by their group value. One way to do this would be to use a logical filter on the data frame:

In [346]: plantGrowthRawDF[plantGrowthRawDF['group']=='ctrl']
Out[346]:   observation   weight  group
        0     1      4.17   ctrl
        1   2      5.58   ctrl
        2   3      5.18   ctrl
        3   4      6.11   ctrl
       ...

This can be tedious, so we would instead like to pivot/unstack this data and display it in a form that is more conducive to analysis. We can do this using the DataFrame.pivot function as follows:

In [345]: plantGrowthRawDF.pivot(index='observation',columns='group',values='weight')
Out[345]: weight    
          group   ctrl   trt1   trt2
          observation
          1       4.17   4.81   6.31
          2       5.58   4.17   5.12
          3       5.18   4.41   5.54
          4       6.11   3.59   5.50
          5       4.50   5.87   5.37
          6       4.61   3.83   5.29
          7       5.17   6.03   4.92
          8       4.53   4.89   6.15
          9       5.33   4.32   5.80
          10      5.14   4.69   5.26

Here, a DataFrame is created with columns corresponding to different values of a group, or in statistical parlance, levels of the factor. The same result can be achieved via the pandas pivot_table function, as follows:

In [427]: pd.pivot_table(plantGrowthRawDF,values='weight', 
                         rows='observation', cols=['group'])

Out[427]:   group  ctrl   trt1   trt2
          observation
           1       4.17   4.81   6.31
           2       5.58   4.17   5.12
           3       5.18   4.41   5.54
           4       6.11   3.59   5.50
           5       4.50   5.87   5.37
           6       4.61   3.83   5.29
           7       5.17   6.03   4.92
           8       4.53   4.89   6.15
           9       5.33   4.32   5.80
           
10      5.14   4.69   5.26

The key difference between the pivot and the pivot_table functions is that pivot_table allows the user to specify an aggregate function over which the values can be aggregated. So, for example, if we wish to obtain the mean for each group over the 10 observations, we would do the following, which would result in a Series:

In [430]: pd.pivot_table(plantGrowthRawDF,values='weight',cols=['group'],aggfunc=np.mean)
Out[430]: group
    ctrl     5.032
    trt1     4.661
    trt2     5.526
    Name: weight, dtype: float64

The full synopsis of pivot_table is available at http://bit.ly/1QomJ5A. You can find more information and examples on its usage at: http://bit.ly/1BYGsNn and https://www.youtube.com/watch?v=mCLuwCql5t4.

Stacking and unstacking

In addition to the pivot functions, the stack and unstack functions are also available on Series and DataFrames, that work on objects containing MultiIndexes.

The stack() function

First, we set the group and observation column values to be the components of the row index respectively, which results in a MultiIndex:

In [349]: plantGrowthRawDF.set_index(['group','observation'])
Out[349]:                    weight
          group  observation
          ctrl     1         4.17
                   2         5.58
                   3         5.18
                   ...
          trt1     1         4.81
                   2         4.17
                   3         4.41
                   ...
          trt2     1        6.31
                   2        5.12
                   3        5.54
                  ...

Here, we see that the row index consists of a MultiIndex on the group and observation with the weight column as the data value. Now, let us see what happens if we apply unstack to the group level:

In [351]: plantGrowthStackedDF.unstack(level='group')
Out[351]:                weight
           group   ctrl trt1   trt2
       observation
            1     4.17 4.81   6.31
            2     5.58 4.17   5.12
            3     5.18 4.41   5.54
            4     6.11 3.59   5.50
            5     4.50 5.87   5.37
            6     4.61 3.83   5.29
            7     5.17 6.03   4.92
            8     4.53 4.89   6.15
            9    5.33   4.32  5.80
           10    5.14   4.69  5.26

The following call is equivalent to the preceding one: plantGrowthStackedDF.unstack(level=0).

Here, we can see that the DataFrame is pivoted and the group has now changed from a row index (headers) to a column index (headers), resulting in a more compact looking DataFrame. To understand what's going on in more detail, we have a MultiIndex as a row index initially on group, observation:

In [356]: plantGrowthStackedDF.index
Out[356]: MultiIndex
          [(u'ctrl', 1), (u'ctrl', 2), (u'ctrl', 3), (u'ctrl', 4), (u'ctrl', 5), 
           (u'ctrl', 6), (u'ctrl', 7), (u'ctrl', 8), (u'ctrl', 9), (u'ctrl', 10), 
           (u'trt1', 1), (u'trt1', 2), (u'trt1', 3), (u'trt1', 4), (u'trt1', 5), 
           (u'trt1', 6), (u'trt1', 7), (u'trt1', 8), (u'trt1', 9), (u'trt1', 10), 
           (u'trt2', 1), (u'trt2', 2), (u'trt2', 3), (u'trt2', 4), (u'trt2', 5), 
           (u'trt2', 6), (u'trt2', 7), (u'trt2', 8), (u'trt2', 9), (u'trt2', 10)]

In [355]: plantGrowthStackedDF.columns
Out[355]: Index([u'weight'], dtype=object)

The unstacking operation removes the group from the row index, changing it into a single-level index:

In [357]: plantGrowthStackedDF.unstack(level='group').index
Out[357]: Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=int64)

The MultiIndex is now on the columns:

In [352]: plantGrowthStackedDF.unstack(level='group').columns
Out[352]: MultiIndex
          [(u'weight', u'ctrl'), (u'weight', u'trt1'), (u'weight', u'trt2')]

Let us see what happens when we call the reverse operation, stack:

In [366]: plantGrowthStackedDF.unstack(level=0).stack('group')
Out[366]:              weight
          observation  group
          1  ctrl   4.17
             trt1   4.81
             trt2   6.31
          2  ctrl   5.58
             trt1   4.17
             trt2   5.12
          3  ctrl   5.18
             trt1   4.41
             trt2   5.54
          4  ctrl   6.11
             trt1   3.59
             trt2   5.50
         ...
         10  ctrl   5.14
             trt1   4.69
             trt2   5.26

Here, we see that what we get isn't the original stacked DataFrame since the stacked level—that is, 'group'—becomes the new lowest level in a MultiIndex on the columns. In the original stacked DataFrame, group was the highest level. Here are the sequence of calls to stack and unstack that are exactly reversible. The unstack() function by default unstacks the last level, which is observation, which is shown as follows:

In [370]: plantGrowthStackedDF.unstack()
Out[370]:                                weight
          observation     1   2   3    4   5    6    7    8     9    10
          group
          ctrl        4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
          trt1        4.81 4.17 4.41 3.59 5.87  3.83 6.03 4.89 4.32 4.69
          trt2        6.31 5.12 5.54 5.50 5.37  5.29 4.92 6.15 5.80 5.26

The stack() function by default sets the stacked level as the lowest level in the resulting MultiIndex on the rows:

In [369]: plantGrowthStackedDF.unstack().stack()
Out[369]:              weight
         group  observation
         ctrl    1   4.17
                 2   5.58
                 3   5.18
         ...
                 10  5.14
         trt1    1   4.81
                 2   4.17
                 3   4.41
         ...
                 10  4.69
         trt2    1   6.31
                 2   5.12
                 3   5.54
         ...
                 10  5.26

Other methods to reshape DataFrames

There are various other methods that are related to reshaping DataFrames; we'll discuss them here.

Using the melt function

The melt function enables us to transform a DataFrame by designating some of its columns as ID columns. This ensures that they will always stay as columns after any pivoting transformations. The remaining non-ID columns can be treated as variable and can be pivoted and become part of a name-value two column scheme. ID columns uniquely identify a row in the DataFrame.

The names of those non-ID columns can be customized by supplying the var_name and value_name parameters. The use of melt is perhaps best illustrated by an example, as follows:

In [385]: from pandas.core.reshape import melt

In [401]: USIndexDataDF[:2]
Out[401]:    TradingDate   Nasdaq    S&P 500   Russell 2000  DJIA
         0   2014/01/30    4123.13   1794.19   1139.36       15848.61
         1   2014/01/31    4103.88   1782.59   1130.88       15698.85

In [402]: melt(USIndexDataDF[:2], id_vars=['TradingDate'], var_name='Index Name', value_name='Index Value')
Out[402]:
            TradingDate   Index Name    Index value
         0  2014/01/30    Nasdaq        4123.13
         1  2014/01/31    Nasdaq        4103.88
         2  2014/01/30    S&P 500       1794.19
         3  2014/01/31    S&P 500       1782.59
         4  2014/01/30    Russell 2000  1139.36
         5  2014/01/31    Russell 2000  1130.88
         6  2014/01/30    DJIA          15848.61
         7  2014/01/31    DJIA          15698.85

The pandas.get_dummies() function

This function is used to convert a categorical variable into an indicator DataFrame, which is essentially a truth table of possible values of the categorical variable. An example of this is the following command:

In [408]: melted=melt(USIndexDataDF[:2], id_vars=['TradingDate'], var_name='Index Name', value_name='Index Value')
       melted
Out[408]:      TradingDate   Index Name    Index Value
         0     2014/01/30    Nasdaq        4123.13
         1     2014/01/31    Nasdaq        4103.88
         2     2014/01/30    S&P 500       1794.19
         3     2014/01/31    S&P 500       1782.59
         4     2014/01/30    Russell 2000  1139.36
5     2014/01/31   Russell 2000   1130.88
6     2014/01/30   DJIA        15848.61
7     2014/01/31   DJIA        15698.85

In [413]: pd.get_dummies(melted['Index Name'])
Out[413]:    DJIA  Nasdaq  Russell 2000  S&P 500
         0   0     1        0            0
         1   0     1        0            0
         2   0     0        0            1
         3   0     0        0            1
         4   0     0        1            0
         5   0     0        1            0
         6   1     0        0            0
         7   1     0        0            0

The source of the preceding data can be found at http://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.149.238