In addition to the standard indexing operator []
and attribute operator, there are operators provided in pandas to make the job of indexing easier and more convenient. By label indexing, we generally mean indexing by a header name, which tends to be a string value in most cases. These operators are as follows:
We will now turn our attention to these operators.
The .loc
operator supports pure label-based indexing. It accepts the following as valid inputs:
For our illustrative dataset, we use the average snowy weather temperature data for New York city from the following:
Create DataFrame
In [723]: NYC_SnowAvgsData={'Months' : ['January','February','March', 'April', 'November', 'December'], 'Avg SnowDays' : [4.0,2.7,1.7,0.2,0.2,2.3], 'Avg Precip. (cm)' : [17.8,22.4,9.1,1.5,0.8,12.2], 'Avg Low Temp. (F)' : [27,29,35,45,42,32] } In [724]: NYC_SnowAvgsData Out[724]:{'Avg Low Temp. (F)': [27, 29, 35, 45, 42, 32], 'Avg Precip. (cm)': [17.8, 22.4, 9.1, 1.5, 0.8, 12.2], 'Avg SnowDays': [4.0, 2.7, 1.7, 0.2, 0.2, 2.3], 'Months': ['January', 'February', 'March', 'April', 'November', 'December']} In [726]:NYC_SnowAvgs=pd.DataFrame(NYC_SnowAvgsData, index=NYC_SnowAvgsData['Months'], columns=['Avg SnowDays','Avg Precip. (cm)', 'Avg Low Temp. (F)']) NYC_SnowAvgs Out[726]: Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F) January 4.0 17.8 27 February 2.7 22.4 29 March 1.7 9.1 35 April 0.2 1.5 45 November 0.2 0.8 42 December 2.3 12.2 32
Using a single label:
In [728]: NYC_SnowAvgs.loc['January'] Out[728]: Avg SnowDays 4.0 Avg Precip. (cm) 17.8 Avg Low Temp. (F) 27.0 Name: January, dtype: float64
Using a list of labels:
In [730]: NYC_SnowAvgs.loc[['January','April']] Out[730]: Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F) January 4.0 17.8 27 April 0.2 1.5 45
Using a label range:
In [731]: NYC_SnowAvgs.loc['January':'March'] Out[731]: Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F) January 4.0 17.8 27 February 2.7 22.4 29 March 1.7 9.1 35
Note that while using the .loc
, .iloc
, and .ix
operators on a DataFrame, the row index must always be specified first. This is the opposite of the []
operator, where only columns can be selected directly. Hence, we get an error if we do the following:
In [771]: NYC_SnowAvgs.loc['Avg SnowDays'] KeyError: 'Avg SnowDays'
The correct way to do this is to specifically select all rows by using the colon (:
) operator as follows:
In [772]: NYC_SnowAvgs.loc[:,'Avg SnowDays'] Out[772]: January 4.0 February 2.7 March 1.7 April 0.2 November 0.2 December 2.3 Name: Avg SnowDays, dtype: float64
Here, we see how to select a specific coordinate value, namely the average number of snow days in March:
In [732]: NYC_SnowAvgs.loc['March','Avg SnowDays'] Out[732]: 1.7
This alternative style is also supported:
In [733]: NYC_SnowAvgs.loc['March']['Avg SnowDays'] Out[733]: 1.7
The following is the equivalent of the preceding case using the square bracket operator []
:
In [750]: NYC_SnowAvgs['Avg SnowDays']['March'] Out[750]: 1.7
Note again, however, that specifying the row index value first as is done with the .loc
operator will result in Keyerror
. This is a consequence of the fact discussed previously, that the []
operator cannot be used to select rows directly. The columns must be selected first to obtain a Series, which can then be selected by rows. Thus, you will get KeyError: u'no item named March'
if you use either of the following:
In [757]: NYC_SnowAvgs['March']['Avg SnowDays']
Or
In [758]: NYC_SnowAvgs['March']
We can use the .loc
operator to select the rows instead:
In [759]: NYC_SnowAvgs.loc['March'] Out[759]: Avg SnowDays 1.7 Avg Precip. (cm) 9.1 Avg Low Temp. (F) 35.0 Name: March, dtype: float64
Now, we will show how to select which months have less than one snow day on average by using a Boolean array:
In [763]: NYC_SnowAvgs.loc[NYC_SnowAvgs['Avg SnowDays']<1,:] Out[763]: Avg SnowDays Avg Precip. (cm) Avg Low Temp. (F) April 0.2 1.5 45 November 0.2 0.8 42
Or, in the case of the spot crude prices mentioned earlier, select the column corresponding to the brand of crude that was priced above 110 a barrel for row 2013-Q1:
In [768]: SpotCrudePrices_2013.loc[:,SpotCrudePrices_2013.loc['2013-Q1']>110] Out[768]: UK_Brent 2013-Q1 112.9 2013-Q2 103.0 2013-Q3 110.1 2013-Q4 109.4
Note that the preceding arguments involve the Boolean operators <
and >
that actually evaluate the Boolean arrays, for example:
In [769]: SpotCrudePrices_2013.loc['2013-Q1']>110 Out[769]: Dubai False UK_Brent True West_Texas_Intermediate False Name: 2013-Q1, dtype: bool
The .iloc
operator supports integer-based positional indexing. It accepts the following as inputs:
Let us create the following:
In [777]: import scipy.constants as phys import math In [782]: sci_values=pd.DataFrame([[math.pi, math.sin(math.pi), math.cos(math.pi)], [math.e,math.log(math.e), phys.golden], [phys.c,phys.g,phys.e], [phys.m_e,phys.m_p,phys.m_n]], index=list(range(0,20,5))) Out[782]: 0 1 2 0 3.141593e+00 1.224647e-16 -1.000000e+00 5 2.718282e+00 1.000000e+00 1.618034e+00 10 2.997925e+08 9.806650e+00 1.602177e-19 15 9.109383e-31 1.672622e-27 1.674927e-27
We can select the non-physical constants in the first two rows by using integer slicing:
In [789]: sci_values.iloc[:2] Out[789]: 0 1 2 0 3.141593 1.224647e-16 -1.000000 5 2.718282 1.000000e+00 1.618034
Alternatively, we can use the speed of light and the acceleration of gravity in the third row:
In [795]: sci_values.iloc[2,0:2] Out[795]: 0 2.997925e+08 1 9.806650e+00 dtype: float64
Note that the arguments to .iloc
are strictly positional and have nothing to do with the index values. Hence, consider a case where we mistakenly think that we can obtain the third row by using the following:
In [796]: sci_values.iloc[10] ------------------------------------------------------ IndexError Traceback (most recent call last) ... IndexError: index 10 is out of bounds for axis 0 with size 4
Here, we get IndexError
in the preceding result; so, now, we should use the label-indexing operator .loc
instead, as follows:
In [797]: sci_values.loc[10] Out[797]: 0 2.997925e+08 1 9.806650e+00 2 1.602177e-19 Name: 10, dtype: float64
To slice out a specific row, we can use the following:
In [802]: sci_values.iloc[2:3,:] Out[802]: 0 1 2 10 299792458 9.80665 1.602177e-19
To obtain a cross-section using an integer position, use the following:
In [803]: sci_values.iloc[3] Out[803]: 0 9.109383e-31 1 1.672622e-27 2 1.674927e-27 Name: 15, dtype: float64
If we attempt to slice past the end of the array, we obtain IndexError
as follows:
In [805]: sci_values.iloc[6,:] -------------------------------------------------------- IndexError Traceback (most recent call last) IndexError: index 6 is out of bounds for axis 0 with size 4
The .iat
and .at
operators can be used for a quick selection of scalar values. This is best illustrated as follows:
In [806]: sci_values.iloc[3,0] Out[806]: 9.1093829099999999e-31 In [807]: sci_values.iat[3,0] Out[807]: 9.1093829099999999e-31 In [808]: %timeit sci_values.iloc[3,0] 10000 loops, best of 3: 122 μs per loop In [809]: %timeit sci_values.iat[3,0] 10000 loops, best of 3: 28.4 μs per loop
Thus, we can see that .iat
is much faster than the .iloc
/.ix
operators. The same applies to .at
versus .loc
.
The .ix
operator behaves like a mixture of the .loc
and .iloc
operators, with the .loc
behavior taking precedence. It takes the following as possible inputs:
Let us re-create the following DataFrame by saving the stock index closing price data to a file (stock_index_closing.csv
) and reading it in:
TradingDate,Nasdaq,S&P 500,Russell 2000 2014/01/30,4123.13,1794.19,1139.36 2014/01/31,4103.88,1782.59,1130.88 2014/02/03,3996.96,1741.89,1094.58 2014/02/04,4031.52,1755.2,1102.84 2014/02/05,4011.55,1751.64,1093.59 2014/02/06,4057.12,1773.43,1103.93
The source for this data is http://www.economagic.com/sp.htm#Daily. Here's how we read the CSV data into a DataFrame:
In [939]: stockIndexDataDF=pd.read_csv('./stock_index_data.csv') In [940]: stockIndexDataDF Out[940]: TradingDate Nasdaq S&P 500 Russell 2000 0 2014/01/30 4123.13 1794.19 1139.36 1 2014/01/31 4103.88 1782.59 1130.88 2 2014/02/03 3996.96 1741.89 1094.58 3 2014/02/04 4031.52 1755.20 1102.84 4 2014/02/05 4011.55 1751.64 1093.59 5 2014/02/06 4057.12 1773.43 1103.93
What we see from the preceding example is that the DataFrame created has an integer-based row index. We promptly set the index to be the trading date to index it based on the trading date so that we can use the .ix
operator:
In [941]: stockIndexDF=stockIndexDataDF.set_index('TradingDate') In [942]:stockIndexDF Out[942]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/01/30 4123.13 1794.19 1139.36 2014/01/31 4103.88 1782.59 1130.88 2014/02/03 3996.96 1741.89 1094.58 2014/02/04 4031.52 1755.20 1102.84 2014/02/05 4011.55 1751.64 1093.59 2014/02/06 4057.12 1773.43 1103.93
We now show examples of using the .ix
operator:
Using a single label:
In [927]: stockIndexDF.ix['2014/01/30'] Out[927]: Nasdaq 4123.13 S&P 500 1794.19 Russell 2000 1139.36 Name: 2014/01/30, dtype: float64
Using a list of labels:
In [928]: stockIndexDF.ix[['2014/01/30']] Out[928]: Nasdaq S&P 500 Russell 2000 2014/01/30 4123.13 1794.19 1139.36 In [930]: stockIndexDF.ix[['2014/01/30','2014/01/31']] Out[930]: Nasdaq S&P 500 Russell 2000 2014/01/30 4123.13 1794.19 1139.36 2014/01/31 4103.88 1782.59 1130.88
Note the difference in the output between using a single label versus using a list containing just a single label. The former results in a series and the latter, a DataFrame:
In [943]: type(stockIndexDF.ix['2014/01/30']) Out[943]: pandas.core.series.Series In [944]: type(stockIndexDF.ix[['2014/01/30']]) Out[944]: pandas.core.frame.DataFrame
For the former, the indexer is a scalar; for the latter, the indexer is a list. A list indexer is used to select multiple columns. A multi-column slice of a DataFrame can only result in another DataFrame since it is 2D; hence, what is returned in the latter case is a DataFrame.
Using a label-based slice:
In [932]: tradingDates=stockIndexDataDF.TradingDate In [934]: stockIndexDF.ix[tradingDates[:3]] Out[934]: Nasdaq S&P 500 Russell 2000 2014/01/30 4123.13 1794.19 1139.36 2014/01/31 4103.88 1782.59 1130.88 2014/02/03 3996.96 1741.89 1094.58
Using a single integer:
In [936]: stockIndexDF.ix[0] Out[936]: Nasdaq 4123.13 S&P 500 1794.19 Russell 2000 1139.36 Name: 2014/01/30, dtype: float64
Using a list of integers:
In [938]: stockIndexDF.ix[[0,2]] Out[938]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/01/30 4123.13 1794.19 1139.36 2014/02/03 3996.96 1741.89 1094.58
Using an integer slice:
In [947]: stockIndexDF.ix[1:3] Out[947]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/01/31 4103.88 1782.59 1130.88 2014/02/03 3996.96 1741.89 1094.58
Using a Boolean array:
In [949]: stockIndexDF.ix[stockIndexDF['Russell 2000']>1100] Out[949]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/01/30 4123.13 1794.19 1139.36 2014/01/31 4103.88 1782.59 1130.88 2014/02/04 4031.52 1755.20 1102.84 2014/02/06 4057.12 1773.43 1103.93
As in the case of .loc
, the row index must be specified first for the .ix
operator.
We now turn to the topic of MultiIndexing. Multi-level or hierarchical indexing is useful because it enables the pandas user to select and massage data in multiple dimensions by using data structures such as Series and DataFrame. In order to start, let us save the following data to a file: stock_index_prices.csv
and read it in:
TradingDate,PriceType,Nasdaq,S&P 500,Russell 2000 2014/02/21,open,4282.17,1841.07,1166.25 2014/02/21,close,4263.41,1836.25,1164.63 2014/02/21,high,4284.85,1846.13,1168.43 2014/02/24,open,4273.32,1836.78,1166.74 2014/02/24,close,4292.97,1847.61,1174.55 2014/02/24,high,4311.13,1858.71,1180.29 2014/02/25,open,4298.48,1847.66,1176 2014/02/25,close,4287.59,1845.12,1173.95 2014/02/25,high,4307.51,1852.91,1179.43 2014/02/26,open,4300.45,1845.79,1176.11 2014/02/26,close,4292.06,1845.16,1181.72 2014/02/26,high,4316.82,1852.65,1188.06 2014/02/27,open,4291.47,1844.9,1179.28 2014/02/27,close,4318.93,1854.29,1187.94 2014/02/27,high,4322.46,1854.53,1187.94 2014/02/28,open,4323.52,1855.12,1189.19 2014/02/28,close,4308.12,1859.45,1183.03 2014/02/28,high,4342.59,1867.92,1193.5 In [950]:sharesIndexDataDF=pd.read_csv('./stock_index_prices.csv') In [951]: sharesIndexDataDF Out[951]: TradingDate PriceType Nasdaq S&P 500 Russell 2000 0 2014/02/21 open 4282.17 1841.07 1166.25 1 2014/02/21 close 4263.41 1836.25 1164.63 2 2014/02/21 high 4284.85 1846.13 1168.43 3 2014/02/24 open 4273.32 1836.78 1166.74 4 2014/02/24 close 4292.97 1847.61 1174.55 5 2014/02/24 high 4311.13 1858.71 1180.29 6 2014/02/25 open 4298.48 1847.66 1176.00 7 2014/02/25 close 4287.59 1845.12 1173.95 8 2014/02/25 high 4307.51 1852.91 1179.43 9 2014/02/26 open 4300.45 1845.79 1176.11 10 2014/02/26 close 4292.06 1845.16 1181.72 11 2014/02/26 high 4316.82 1852.65 1188.06 12 2014/02/27 open 4291.47 1844.90 1179.28 13 2014/02/27 close 4318.93 1854.29 1187.94 14 2014/02/27 high 4322.46 1854.53 1187.94 15 2014/02/28 open 4323.52 1855.12 1189.19 16 2014/02/28 close 4308.12 1859.45 1183.03 17 2014/02/28 high 4342.59 1867.92 1193.50
Here, we create a MultiIndex from the trading date and priceType
columns:
In [958]: sharesIndexDF=sharesIndexDataDF.set_index(['TradingDate','PriceType']) In [959]: mIndex=sharesIndexDF.index; mIndex Out[959]: MultiIndex [(u'2014/02/21', u'open'), (u'2014/02/21', u'close'), (u'2014/02/21', u'high'), (u'2014/02/24', u'open'), (u'2014/02/24', u'close'), (u'2014/02/24', u'high'), (u'2014/02/25', u'open'), (u'2014/02/25', u'close'), (u'2014/02/25', u'high'), (u'2014/02/26', u'open'), (u'2014/02/26', u'close'), (u'2014/02/26', u'high'), (u'2014/02/27', u'open'), (u'2014/02/27', u'close'), (u'2014/02/27', u'high'), (u'2014/02/28', u'open'), (u'2014/02/28', u'close'), (u'2014/02/28', u'high')] In [960]: sharesIndexDF Out[960]: Nasdaq S&P 500 Russell 2000 TradingDate PriceType 2014/02/21 open 4282.17 1841.07 1166.25 close 4263.41 1836.25 1164.63 high 4284.85 1846.13 1168.43 2014/02/24 open 4273.32 1836.78 1166.74 close 4292.97 1847.61 1174.55 high 4311.13 1858.71 1180.29 2014/02/25 open 4298.48 1847.66 1176.00 close 4287.59 1845.12 1173.95 high 4307.51 1852.91 1179.43 2014/02/26 open 4300.45 1845.79 1176.11 close 4292.06 1845.16 1181.72 high 4316.82 1852.65 1188.06 2014/02/27 open 4291.47 1844.90 1179.28 close 4318.93 1854.29 1187.94 high 4322.46 1854.53 1187.94 2014/02/28 open 4323.52 1855.12 1189.19 close 4308.12 1859.45 1183.03 high 4342.59 1867.92 1193.50
Upon inspection, we see that the MultiIndex consists of a list of tuples. Applying the get_level_values
function with the appropriate argument produces a list of the labels for each level of the index:
In [962]: mIndex.get_level_values(0) Out[962]: Index([u'2014/02/21', u'2014/02/21', u'2014/02/21', u'2014/02/24', u'2014/02/24', u'2014/02/24', u'2014/02/25', u'2014/02/25', u'2014/02/25', u'2014/02/26', u'2014/02/26', u'2014/02/26', u'2014/02/27', u'2014/02/27', u'2014/02/27', u'2014/02/28', u'2014/02/28', u'2014/02/28'], dtype=object) In [963]: mIndex.get_level_values(1) Out[963]: Index([u'open', u'close', u'high', u'open', u'close', u'high', u'open', u'close', u'high', u'open', u'close', u'high', u'open', u'close', u'high', u'open', u'close', u'high'], dtype=object)
However, IndexError
will be thrown if the value passed to get_level_values()
is invalid or out of range:
In [88]: mIndex.get_level_values(2) --------------------------------------------------------- IndexError Traceback (most recent call last) ...
You can achieve hierarchical indexing with a MultiIndexed DataFrame:
In [971]: sharesIndexDF.ix['2014/02/21'] Out[971]: Nasdaq S&P 500 Russell 2000 PriceType open 4282.17 1841.07 1166.25 close 4263.41 1836.25 1164.63 high 4284.85 1846.13 1168.43 In [976]: sharesIndexDF.ix['2014/02/21','open'] Out[976]: Nasdaq 4282.17 S&P 500 1841.07 Russell 2000 1166.25 Name: (2014/02/21, open), dtype: float64
We can slice using a MultiIndex:
In [980]: sharesIndexDF.ix['2014/02/21':'2014/02/24'] Out[980]: Nasdaq S&P 500 Russell 2000 TradingDate PriceType 2014/02/21 open 4282.17 1841.07 1166.25 close 4263.41 1836.25 1164.63 high 4284.85 1846.13 1168.43 2014/02/24 open 4273.32 1836.78 1166.74 close 4292.97 1847.61 1174.55 high 4311.13 1858.71 1180.29
We can try slicing at a lower level:
In [272]: sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')] ------------------------------------------------------------------ KeyError Traceback (most recent call last) <ipython-input-272-65bb3364d980> in <module>() ----> 1 sharesIndexDF.ix[('2014/02/21','open'):('2014/02/24','open')] ... KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
However, this results in KeyError
with a rather strange error message. The key lesson to be learned here is that the current incarnation of MultiIndex requires the labels to be sorted for the lower-level slicing routines to work correctly.
In order to do this, you can utilize the sortlevel()
method, which sorts the labels of an axis within a MultiIndex. To be on the safe side, sort first before slicing with a MultiIndex. Thus, we can do the following:
In [984]: sharesIndexDF.sortlevel(0).ix[('2014/02/21','open'):('2014/02/24','open')] Out[984]: Nasdaq S&P 500 Russell 2000 TradingDate PriceType 2014/02/21 open 4282.17 1841.07 1166.25 2014/02/24 close 4292.97 1847.61 1174.55 high 4311.13 1858.71 1180.29 open 4273.32 1836.78 1166.74
We can also pass a list of tuples:
In [985]: sharesIndexDF.ix[[('2014/02/21','close'),('2014/02/24','open')]] Out[985]: Nasdaq S&P 500 Russell 2000 TradingDate PriceType 2014/02/21 close 4263.41 1836.25 1164.63 2014/02/24 open 4273.32 1836.78 1166.74 2 rows × 3 columns
The swaplevel
function enables levels within the MultiIndex to be swapped:
In [281]: swappedDF=sharesIndexDF[:7].swaplevel(0, 1, axis=0) swappedDF Out[281]: Nasdaq S&P 500 Russell 2000 PriceType TradingDate open 2014/02/21 4282.17 1841.07 1166.25 close 2014/02/21 4263.41 1836.25 1164.63 high 2014/02/21 4284.85 1846.13 1168.43 open 2014/02/24 4273.32 1836.78 1166.74 close 2014/02/24 4292.97 1847.61 1174.55 high 2014/02/24 4311.13 1858.71 1180.29 open 2014/02/25 4298.48 1847.66 1176.00 7 rows × 3 columns
The reorder_levels
function is more general, allowing you to specify the order of the levels:
In [285]: reorderedDF=sharesIndexDF[:7].reorder_levels(['PriceType', 'TradingDate'], axis=0) reorderedDF Out[285]: Nasdaq S&P 500 Russell 2000 PriceType TradingDate open 2014/02/21 4282.17 1841.07 1166.25 close 2014/02/21 4263.41 1836.25 1164.63 high 2014/02/21 4284.85 1846.13 1168.43 open 2014/02/24 4273.32 1836.78 1166.74 close 2014/02/24 4292.97 1847.61 1174.55 high 2014/02/24 4311.13 1858.71 1180.29 open 2014/02/25 4298.48 1847.66 1176.00 7 rows × 3 columns
The xs
method provides a shortcut means of selecting data based on a particular index level value:
In [287]: sharesIndexDF.xs('open',level='PriceType') Out[287]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/02/21 4282.17 1841.07 1166.25 2014/02/24 4273.32 1836.78 1166.74 2014/02/25 4298.48 1847.66 1176.00 2014/02/26 4300.45 1845.79 1176.11 2014/02/27 4291.47 1844.90 1179.28 2014/02/28 4323.52 1855.12 1189.19 6 rows × 3 columns
The more long-winded alternative to the preceding command would be to use swaplevel
to switch between the TradingDate
and PriceType
levels and then, perform the selection as follows:
In [305]: sharesIndexDF.swaplevel(0, 1, axis=0).ix['open'] Out[305]: Nasdaq S&P 500 Russell 2000 TradingDate 2014/02/21 4282.17 1841.07 1166.25 2014/02/24 4273.32 1836.78 1166.74 2014/02/25 4298.48 1847.66 1176.00 2014/02/26 4300.45 1845.79 1176.11 2014/02/27 4291.47 1844.90 1179.28 2014/02/28 4323.52 1855.12 1189.19 6 rows × 3 columns
Using .xs
achieves the same effect as obtaining a cross-section in the previous section on integer-oriented indexing.
3.129.210.91