Elements of an array or Series
are selected using the []
operator. DataFrame
overloads []
to select columns instead of rows, except for a specific case of slicing. Therefore, most operations of selection of one or more rows in a DataFrame
, require alternate methods to using []
.
Understanding this is important in pandas, as it is a common mistake is try and select rows using []
due to familiarity with other languages or data structures. When doing so, errors are often received, and can often be difficult to diagnose without realizing []
is working along a completely different axis than with a Series
object.
Row selection using the index on a DataFrame
then breaks down to the following general categories of operations:
We will briefly examine each of these techniques and attributes. Remember, all of these are working against the content of the index of the DataFrame
. There is no involvement with data in the columns in the selection of the rows. We will cover that in the next section on Boolean selection.
Slicing a DataFrame
across its index is syntactically identical to performing the same on a Series
. Because of this, we will not go into the details of the various permutations of slices in this section, and only give representative examples applied to a DataFrame
.
Slicing works along both positions and labels. The following code demonstrates several examples of slicing by position:
In [34]: # first five rows sp500[:5] Out[34]: Sector Price Book Value Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326 ACE Financials 102.91 86.897
The following code returns rows starting with the ABT
label through the ACN
label:
In [35]: # ABT through ACN labels sp500['ABT':'ACN'] Out[35]: Sector Price Book Value Symbol ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326
In general, although slicing of a DataFrame
has its uses, high performance systems tend to shy away from it and use other methods. Additionally, the slice notation for rows on a DataFrame
using integers can be confusing, as it looks like accessing columns by position, and hence can lead to subtle bugs.
Rows can be retrieved via an index label value using .loc[]
. This is shown in the following code:
In [36]: # get row with label MMM # returned as a Series sp500.loc['MMM'] Out[36]: Sector Industrials Price 141.14 Book Value 26.668 Name: MMM, dtype: object In [37]: # rows with label MMM and MSFT # this is a DataFrame result sp500.loc[['MMM', 'MSFT']] Out[37]: Sector Price Book Value Symbol MMM Industrials 141.14 26.668 MSFT Information Technology 40.12 10.584
Rows can also be retrieved by location using .iloc[]
:
In [38]: # get rows in locations 0 and 2 sp500.iloc[[0, 2]] Out[38]: Sector Price Book Value Symbol MMM Industrials 141.14 26.668 ABBV Health Care 53.95 2.954
It is possible to look up the location in the index of a specific label value, which can then be used to retrieve the row(s):
In [39]: # get the location of MMM and A in the index i1 = sp500.index.get_loc('MMM') i2 = sp500.index.get_loc('A') "{0} {1}".format(i1, i2) Out[39]: '0 10' In [40]: # and get the rows sp500.iloc[[i1, i2]] Out[40]: Sector Price Book Value Symbol MMM Industrials 141.14 26.668 A Health Care 56.18 16.928
Like a Series
, a DataFrame
also contains an .ix
property that can be used to lookup rows, either by index label or location, essentially combining .loc
and .iloc
in one. The following looks up rows by index label by passing a list of nonintegers:
In [41]: # by label sp500.ix[['MSFT', 'ZTS']] Out[41]: Sector Price Book Value Symbol MSFT Information Technology 40.12 10.584 ZTS Health Care 30.53 2.150
Location-based lookup can be performed by passing a list of integers:
In [42]: # by location sp500.ix[[10, 200, 450]] Out[42]: Sector Price Book Value Symbol A Health Care 56.18 16.928 GIS Consumer Staples 53.81 10.236 TRV Financials 92.86 73.056
In general, use of .ix
is not preferred due to potential confusion, and use of .loc
and .iloc
is recommended and gives higher performance.
Scalar values can be looked up by label using .at
, by passing both the row label and then the column name/value:
In [43]: # by label in both the index and column sp500.at['MMM', 'Price'] Out[43]: 141.14
Scalar values can also be looked up by location using .iat
by passing both the row location and then the column location. This is the preferred method of accessing single values and gives the highest performance.
In [44]: # by location. Row 0, column 1 sp500.iat[0, 1] Out[44]: 141.14
3.16.79.147