Selecting the data in specific columns of a DataFrame
is performed by using the []
operator. This can be passed either as a single object, or a list of objects. These objects are then used to lookup columns either by zero-based location, or by matching the objects to the values in the columns index.
Passing a single integer, or a list of integers, to []
will have the DataFrame
object attempt to perform a location based lookup of the columns. The following code retrieves the data in the second and third columns:
In [22]: # get first and second columns (1 and 2) by location sp500[[1, 2]].head() Out[22]: Price Book Value Symbol MMM 141.14 26.668 ABT 39.60 15.573 ABBV 53.95 2.954 ACN 79.79 8.326 ACE 102.91 86.897
Selecting columns by passing a list of values will result in another DataFrame
, with data copied from the original DataFrame
. This is true, even if the list only has a single integer / value, as the following code demonstrates:
In [23]: # just the price column sp500[[1]].head() Out[23]: Price Symbol MMM 141.14 ABT 39.60 ABBV 53.95 ACN 79.79 ACE 102.91 In [24]: # it's a DataFrame, not a Series type(sp500[[1]].head()) Out[24]: pandas.core.frame.DataFrame
Note that even though we asked for just a single column by position, the value was still in a list passed to the []
operator and hence the double set of brackets [[]]
. This is important, as not passing a list always results in a value based lookup of the column.
The following code, therefore, throws an exception as the columns index does not have a value of 1
:
In [25]: # this is an exception, hence it is commented # this tries to find a column named '1' # not the row at position 1 # df = sp500[1]
But, this would work if the dtype of the columns index is an integer and it has a value 1
:
In [26]: # create a new DataFrame with integers as the column names # make sure to use .copy() or change will be in-place df = sp500.copy() df.columns=[0, 1, 2] df.head() Out[26]: 0 1 2 Symbol MMM Industrials 141.14 26.668 ABT Health Care 39.60 15.573 ABBV Health Care 53.95 2.954 ACN Information Technology 79.79 8.326 ACE Financials 102.91 86.897 In [27]: # this is not an exception df[1] Out[27]: Symbol MMM 141.14 ABT 39.60 ABBV 53.95 ... ZMH 101.84 ZION 28.43 ZTS 30.53 Name: 1, Length: 500, dtype: float64 In [28]: # because the column names are actually integers # and therefore [1] is found as a column df.columns Out[28]: Int64Index([0, 1, 2], dtype='int64')
In the preceding code, notice the selection of a single column using a single integer; when the DataFrame
has an integer column index, it returns a Series
and not a DataFrame
.
In [29]: # this is a Series not a DataFrame type(df[1]) Out[29]: pandas.core.series.Series
If you need a DataFrame
, you can pass this Series
to the constructor of a new DataFrame
object, or pass the single value in a list.
If the values passed to []
consist of nonintegers, then the DataFrame
will attempt to match the values to the values in the columns index. The following code retrieves the Price
column by name:
In [30]: # get price column by name # result is a Series sp500['Price'] Out[30]: Symbol MMM 141.14 ABT 39.60 ABBV 53.95 ... ZMH 101.84 ZION 28.43 ZTS 30.53 Name: Price, Length: 500, dtype: float64
Like with a column selection with a single location, this returned a Series
and not a DataFrame
object.
Multiple columns can be selected by name by passing a list of the values, and results in a DataFrame
(even if a single item is passed in the list). The code is as follows:
In [31]: # get Price and Sector columns # since a list is passed, the result is a DataFrame sp500[['Price', 'Sector']] Out[31]: Price Sector Symbol MMM 141.14 Industrials ABT 39.60 Health Care ABBV 53.95 Health Care ACN 79.79 Information Technology ACE 102.91 Financials ... ... ... YHOO 35.02 Information Technology YUM 74.77 Consumer Discretionary ZMH 101.84 Health Care ZION 28.43 Financials ZTS 30.53 Health Care [500 rows x 2 columns]
Columns can also be retrieved using what is referred to as attribute access. Each column in a DataFrame
dynamically adds a property to the DataFrame
for each column where the name of the property is the name of the column. Since this selects a single column, the resulting value is a Series
:
In [32]: # attribute access of the column by name sp500.Price Out[32]: Symbol MMM 141.14 ABT 39.60 ABBV 53.95 ... ZMH 101.84 ZION 28.43 ZTS 30.53 Name: Price, Length: 500, dtype: float64
Note that this will not work for the Book Value
column, as the name has a space.
If you do want to find the zero-based location of one or more columns using the name of the column (technically, the value of the index entry of a column), use the .get_loc()
method of the columns
index:
In [33]: # get the position of the column with the value of Price loc = sp500.columns.get_loc('Price') loc Out[33]: 1
3.144.19.243