Hierarchical indexing is a feature of pandas that allows specifying two or more index levels on an axis. The specification of multiple levels in an index allows for efficient selection of subsets of data. A pandas index that has multiple levels of hierarchy is referred to as a MultiIndex
.
We can demonstrate creating a MultiIndex
using the sp500
data. Suppose we want to organize this data by both the Sector
and Symbol
. We can accomplish this with the following code:
In [107]: # first, push symbol into a column reindexed = sp500.reset_index() # and now index sp500 by sector and symbol multi_fi = reindexed.set_index(['Sector', 'Symbol']) multi_fi Out[107]: Price BookValue Sector Symbol Industrials MMM 141.14 26.668 Health Care ABT 39.60 15.573 ABBV 53.95 2.954 Information Technology ACN 79.79 8.326 Financials ACE 102.91 86.897 ... ... ... Information Technology YHOO 35.02 12.768 Consumer Discretionary YUM 74.77 5.147 Health Care ZMH 101.84 37.181 Financials ZION 28.43 30.191 Health Care ZTS 30.53 2.150 [500 rows x 2 columns]
We can now examine the .index
property and check whether it is a MultiIndex
object:
In [108]: # the index is a MultiIndex type(multi_fi.index) Out[108]: pandas.core.index.MultiIndex
Then, examine the index itself:
In [109]: # examine the index print (multi_fi.index) Out[109]: Sector Symbol Industrials MMM Health Care ABT ABBV Information Technology ACN ... Information Technology YHOO Consumer Discretionary YUM Health Care ZMH Financials ZION Health Care ZTS
A MultiIndex
contains two or more levels:
In [110]: # this has two levels len(multi_fi.index.levels) Out[110]: 2
Also, each level is a distinct Index
object:
In [111]: # each index level is an index multi_fi.index.levels[0] Out[111]: Index([u'Consumer Discretionary', u'Consumer Discretionary ', u'Consumer Staples', u'Consumer Staples ', u'Energy', u'Financials', u'Health Care', u'Industrials', u'Industries', u'Information Technology', u'Materials', u'Telecommunications Services', u'Utilities'], dtype='object') In [112]: # each index level is an index multi_fi.index.levels[1] Out[112]: Index([u'A', u'AA', u'AAPL', u'ABBV', u'ABC', u'ABT', u'ACE', u'ACN', u'ACT', u'ADBE', u'ADI', u'ADM', u'ADP', u'ADS', u'ADSK', u'ADT', u'AEE', u'AEP', u'AES', u'AET', u'AFL', u'AGN', u'AIG', u'AIV', u'AIZ', u'AKAM', u'ALL', u'ALLE', u'ALTR', u'ALXN', u'AMAT', u'AME', u'AMGN', u'AMP', u'AMT', u'AMZN', u'AN', u'AON', u'APA', u'APC', u'APD', u'APH', u'ARG', u'ATI', u'AVB', u'AVP', u'AVY', u'AXP', u'AZO', u'BA', u'BAC', u'BAX', u'BBBY', u'BBT', u'BBY', u'BCR', u'BDX', u'BEAM', u'BEN', u'BF-B', u'BHI', u'BIIB', u'BK', u'BLK', u'BLL', u'BMS', u'BMY', u'BRCM', u'BRK-B', u'BSX', u'BTU', u'BWA', u'BXP', u'C', u'CA', u'CAG', u'CAH', u'CAM', u'CAT', u'CB', u'CBG', u'CBS', u'CCE', u'CCI', u'CCL', u'CELG', u'CERN', u'CF', u'CFN', u'CHK', u'CHRW', u'CI', u'CINF', u'CL', u'CLF', u'CLX', u'CMA', u'CMCSA', u'CME', u'CMG', ...], dtype='object')
Values of the index, at a specific level for every row, can be retrieved by the .get_level_values()
method:
In [113]: # values of the index level 0 multi_fi.index.get_level_values(0) Out[113]: Index([u'Industrials', u'Health Care', u'Health Care', u'Information Technology', u'Financials', u'Health Care', u'Information Technology', u'Utilities', u'Health Care', u'Financials', u'Health Care', u'Utilities', u'Materials', u'Materials', u'Information Technology', u'Materials', u'Health Care', u'Materials', u'Industrials', u'Health Care', u'Information Technology', u'Financials', u'Information Technology', u'Consumer Staples', u'Consumer Discretionary', u'Utilities', u'Utilities', u'Financials', u'Financials', u'Financials', u'Financials', u'Health Care', u'Information Technology', u'Health Care', u'Industrials', u'Energy', u'Information Technology', u'Financials', u'Energy', u'Financials', u'Information Technology', u'Information Technology', u'Consumer Staples', u'Financials', u'Telecommunications Services', u'Information Technology', u'Information Technology', u'Consumer Discretionary', u'Consumer Discretionary', u'Financials', u'Industrials', u'Consumer Staples', u'Energy', u'Materials', u'Financials', u'Health Care', u'Health Care', u'Financials', u'Consumer Discretionary', u'Health Care', u'Consumer Discretionary', u'Materials', u'Financials', u'Consumer Discretionary', u'Health Care', u'Financials', u'Consumer Discretionary', u'Industrials', u'Consumer Discretionary', u'Financials', u'Health Care', u'Health Care', u'Information Technology', u'Consumer Staples', u'Information Technology', u'Consumer Discretionary', u'Energy', u'Energy', u'Consumer Staples', u'Financials', u'Health Care', u'Health Care', u'Consumer Discretionary', u'Consumer Discretionary', u'Industrials', u'Financials', u'Consumer Discretionary', u'Health Care', u'Utilities', u'Telecommunications Services', u'Health Care', u'Materials', u'Industrials', u'Energy', u'Energy', u'Consumer Discretionary', u'Financials', u'Health Care', u'Financials', u'Industrials', ...], dtype='object')
Access of elements via a hierarchical index is performed using the .xs()
method. This method works similar to the .ix
attribute but provides parameters to specify multidimensionality of the index.
The following code selects all items with a level 0
index value of Industrials
:
In [114]: # get all stocks that are Industrials # note the result drops level 0 of the index multi_fi.xs('Industrials') Out[114]: Price BookValue Symbol MMM 141.14 26.668 ALLE 52.46 0.000 APH 95.71 18.315 AVY 48.20 15.616 BA 132.41 19.870 ... ... ... UNP 196.26 46.957 UPS 102.73 6.790 UTX 115.54 35.252 WM 43.37 12.330 XYL 38.42 12.127 [64 rows x 2 columns]
To select the rows with a specific value of the index at level 1
, use the level parameter. The following code selects rows where the Symbol
component of the index is ALLE
.
In [115]: # select rows where level 1 (Symbol) is ALLE # note that the Sector level is dropped from the result multi_fi.xs('ALLE', level=1) Out[115]: Price BookValue Sector Industrials 52.46 0
To prevent levels from being dropped, you can use the drop_levels=False
option:
In [116]: # Industrials, without dropping the level multi_fi.xs('Industrials', drop_level=False) Out[116]: Price BookValue Sector Symbol Industrials MMM 141.14 26.668 ALLE 52.46 0.000 APH 95.71 18.315 AVY 48.20 15.616 BA 132.41 19.870 ... ... ... UNP 196.26 46.957 UPS 102.73 6.790 UTX 115.54 35.252 WM 43.37 12.330 XYL 38.42 12.127 [64 rows x 2 columns]
To select from a hierarchy of indexes you can chain .xs()
calls with different levels together. The following code selects the row with Industrials
at level 0
and UPS
at level 1
:
In [117]: # drill through the levels multi_fi.xs('Industrials').xs('UPS') Out[117]: Price 102.73 BookValue 6.79 Name: UPS, dtype: float64
An alternate syntax is to pass the values of each level of the hierarchical index as a tuple:
In [118]: # drill through using tuples multi_fi.xs(('Industrials', 'UPS')) Out[118]: Price 102.73 BookValue 6.79 Name: (Industrials, UPS), dtype: float64
Note that .xs()
can only be used for getting, not setting, values.
One of the things I'd like to point out about indexing in pandas, is that a pandas index is its own set of data, not references to data in the Series
or DataFrame
. This is different from how indexes are used in SQL databases, where the index is built upon the actual data in the table. The values in a pandas index can be completely different from the data in the row that it references, and it can be changed as needed to support much more interactive analysis than can be done with SQL.
3.12.136.186