This section deals with how you can reshape data. Sometimes, data is stored in what is known as the stacked format. Here is an example of a stacked data using the PlantGrowth
dataset:
In [344]: plantGrowthRawDF=pd.read_csv('./PlantGrowth.csv') plantGrowthRawDF Out[344]: observation weight group 0 1 4.17 ctrl 1 2 5.58 ctrl 2 3 5.18 ctrl ... 10 1 4.81 trt1 11 2 4.17 trt1 12 3 4.41 trt1 ... 20 1 6.31 trt2 21 2 5.12 trt2 22 3 5.54 trt2
This data consists of results from an experiment to compare dried weight yields of plants that were obtained under a control (ctrl) and two different treatment conditions (trt1, trt2). Suppose we wanted to do some analysis of this data by their group value. One way to do this would be to use a logical filter on the data frame:
In [346]: plantGrowthRawDF[plantGrowthRawDF['group']=='ctrl'] Out[346]: observation weight group 0 1 4.17 ctrl 1 2 5.58 ctrl 2 3 5.18 ctrl 3 4 6.11 ctrl ...
This can be tedious, so we would instead like to pivot/unstack this data and display it in a form that is more conducive to analysis. We can do this using the DataFrame.pivot
function as follows:
In [345]: plantGrowthRawDF.pivot(index='observation',columns='group',values='weight') Out[345]: weight group ctrl trt1 trt2 observation 1 4.17 4.81 6.31 2 5.58 4.17 5.12 3 5.18 4.41 5.54 4 6.11 3.59 5.50 5 4.50 5.87 5.37 6 4.61 3.83 5.29 7 5.17 6.03 4.92 8 4.53 4.89 6.15 9 5.33 4.32 5.80 10 5.14 4.69 5.26
Here, a DataFrame is created with columns corresponding to different values of a group, or in statistical parlance, levels of the factor. The same result can be achieved via the pandas pivot_table
function, as follows:
In [427]: pd.pivot_table(plantGrowthRawDF,values='weight', rows='observation', cols=['group']) Out[427]: group ctrl trt1 trt2 observation 1 4.17 4.81 6.31 2 5.58 4.17 5.12 3 5.18 4.41 5.54 4 6.11 3.59 5.50 5 4.50 5.87 5.37 6 4.61 3.83 5.29 7 5.17 6.03 4.92 8 4.53 4.89 6.15 9 5.33 4.32 5.80 10 5.14 4.69 5.26
The key difference between the pivot
and the pivot_table
functions is that pivot_table
allows the user to specify an aggregate function over which the values can be aggregated. So, for example, if we wish to obtain the mean for each group over the 10 observations, we would do the following, which would result in a Series:
In [430]: pd.pivot_table(plantGrowthRawDF,values='weight',cols=['group'],aggfunc=np.mean) Out[430]: group ctrl 5.032 trt1 4.661 trt2 5.526 Name: weight, dtype: float64
The full synopsis of pivot_table
is available at http://bit.ly/1QomJ5A. You can find more information and examples on its usage at: http://bit.ly/1BYGsNn and https://www.youtube.com/watch?v=mCLuwCql5t4.
In addition to the pivot functions, the stack
and unstack
functions are also available on Series and DataFrames, that work on objects containing MultiIndexes.
First, we set the group and observation column values to be the components of the row index respectively, which results in a MultiIndex:
In [349]: plantGrowthRawDF.set_index(['group','observation']) Out[349]: weight group observation ctrl 1 4.17 2 5.58 3 5.18 ... trt1 1 4.81 2 4.17 3 4.41 ... trt2 1 6.31 2 5.12 3 5.54 ...
Here, we see that the row index consists of a MultiIndex on the group and observation with the weight column as the data value. Now, let us see what happens if we apply unstack
to the group
level:
In [351]: plantGrowthStackedDF.unstack(level='group') Out[351]: weight group ctrl trt1 trt2 observation 1 4.17 4.81 6.31 2 5.58 4.17 5.12 3 5.18 4.41 5.54 4 6.11 3.59 5.50 5 4.50 5.87 5.37 6 4.61 3.83 5.29 7 5.17 6.03 4.92 8 4.53 4.89 6.15 9 5.33 4.32 5.80 10 5.14 4.69 5.26
The following call is equivalent to the preceding one: plantGrowthStackedDF.unstack(level=0)
.
Here, we can see that the DataFrame is pivoted and the group has now changed from a row index (headers) to a column index (headers), resulting in a more compact looking DataFrame. To understand what's going on in more detail, we have a MultiIndex as a row index initially on group, observation:
In [356]: plantGrowthStackedDF.index Out[356]: MultiIndex [(u'ctrl', 1), (u'ctrl', 2), (u'ctrl', 3), (u'ctrl', 4), (u'ctrl', 5), (u'ctrl', 6), (u'ctrl', 7), (u'ctrl', 8), (u'ctrl', 9), (u'ctrl', 10), (u'trt1', 1), (u'trt1', 2), (u'trt1', 3), (u'trt1', 4), (u'trt1', 5), (u'trt1', 6), (u'trt1', 7), (u'trt1', 8), (u'trt1', 9), (u'trt1', 10), (u'trt2', 1), (u'trt2', 2), (u'trt2', 3), (u'trt2', 4), (u'trt2', 5), (u'trt2', 6), (u'trt2', 7), (u'trt2', 8), (u'trt2', 9), (u'trt2', 10)] In [355]: plantGrowthStackedDF.columns Out[355]: Index([u'weight'], dtype=object)
The unstacking operation removes the group from the row index, changing it into a single-level index:
In [357]: plantGrowthStackedDF.unstack(level='group').index Out[357]: Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=int64)
The MultiIndex is now on the columns:
In [352]: plantGrowthStackedDF.unstack(level='group').columns Out[352]: MultiIndex [(u'weight', u'ctrl'), (u'weight', u'trt1'), (u'weight', u'trt2')]
Let us see what happens when we call the reverse operation, stack
:
In [366]: plantGrowthStackedDF.unstack(level=0).stack('group') Out[366]: weight observation group 1 ctrl 4.17 trt1 4.81 trt2 6.31 2 ctrl 5.58 trt1 4.17 trt2 5.12 3 ctrl 5.18 trt1 4.41 trt2 5.54 4 ctrl 6.11 trt1 3.59 trt2 5.50 ... 10 ctrl 5.14 trt1 4.69 trt2 5.26
Here, we see that what we get isn't the original stacked DataFrame since the stacked level—that is, 'group'—
becomes the new lowest level in a MultiIndex on the columns. In the original stacked DataFrame, group
was the highest level. Here are the sequence of calls to stack
and unstack
that are exactly reversible. The unstack()
function by default unstacks the last level, which is observation
, which is shown as follows:
In [370]: plantGrowthStackedDF.unstack() Out[370]: weight observation 1 2 3 4 5 6 7 8 9 10 group ctrl 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14 trt1 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69 trt2 6.31 5.12 5.54 5.50 5.37 5.29 4.92 6.15 5.80 5.26
The stack()
function by default sets the stacked level as the lowest level in the resulting MultiIndex on the rows:
In [369]: plantGrowthStackedDF.unstack().stack() Out[369]: weight group observation ctrl 1 4.17 2 5.58 3 5.18 ... 10 5.14 trt1 1 4.81 2 4.17 3 4.41 ... 10 4.69 trt2 1 6.31 2 5.12 3 5.54 ... 10 5.26
There are various other methods that are related to reshaping DataFrames; we'll discuss them here.
The melt
function enables us to transform a DataFrame by designating some of its columns as ID columns. This ensures that they will always stay as columns after any pivoting transformations. The remaining non-ID columns can be treated as variable and can be pivoted and become part of a name-value two column scheme. ID columns uniquely identify a row in the DataFrame.
The names of those non-ID columns can be customized by supplying the var_name
and value_name
parameters. The use of melt
is perhaps best illustrated by an example, as follows:
In [385]: from pandas.core.reshape import melt In [401]: USIndexDataDF[:2] Out[401]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/30 4123.13 1794.19 1139.36 15848.61 1 2014/01/31 4103.88 1782.59 1130.88 15698.85 In [402]: melt(USIndexDataDF[:2], id_vars=['TradingDate'], var_name='Index Name', value_name='Index Value') Out[402]: TradingDate Index Name Index value 0 2014/01/30 Nasdaq 4123.13 1 2014/01/31 Nasdaq 4103.88 2 2014/01/30 S&P 500 1794.19 3 2014/01/31 S&P 500 1782.59 4 2014/01/30 Russell 2000 1139.36 5 2014/01/31 Russell 2000 1130.88 6 2014/01/30 DJIA 15848.61 7 2014/01/31 DJIA 15698.85
This function is used to convert a categorical variable into an indicator DataFrame, which is essentially a truth table of possible values of the categorical variable. An example of this is the following command:
In [408]: melted=melt(USIndexDataDF[:2], id_vars=['TradingDate'], var_name='Index Name', value_name='Index Value') melted Out[408]: TradingDate Index Name Index Value 0 2014/01/30 Nasdaq 4123.13 1 2014/01/31 Nasdaq 4103.88 2 2014/01/30 S&P 500 1794.19 3 2014/01/31 S&P 500 1782.59 4 2014/01/30 Russell 2000 1139.36 5 2014/01/31 Russell 2000 1130.88 6 2014/01/30 DJIA 15848.61 7 2014/01/31 DJIA 15698.85 In [413]: pd.get_dummies(melted['Index Name']) Out[413]: DJIA Nasdaq Russell 2000 S&P 500 0 0 1 0 0 1 0 1 0 0 2 0 0 0 1 3 0 0 0 1 4 0 0 1 0 5 0 0 1 0 6 1 0 0 0 7 1 0 0 0
The source of the preceding data can be found at http://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv.
3.133.158.36