Selecting columns of a DataFrame

Selecting the data in specific columns of a DataFrame is performed by using the [] operator. This can be passed either as a single object, or a list of objects. These objects are then used to lookup columns either by zero-based location, or by matching the objects to the values in the columns index.

Passing a single integer, or a list of integers, to [] will have the DataFrame object attempt to perform a location based lookup of the columns. The following code retrieves the data in the second and third columns:

In [22]:
   # get first and second columns (1 and 2) by location
   sp500[[1, 2]].head()

Out[22]:
            Price  Book Value
   Symbol                    
   MMM     141.14      26.668
   ABT      39.60      15.573
   ABBV     53.95       2.954
   ACN      79.79       8.326
   ACE     102.91      86.897

Selecting columns by passing a list of values will result in another DataFrame, with data copied from the original DataFrame. This is true, even if the list only has a single integer / value, as the following code demonstrates:

In [23]:
   # just the price column
   sp500[[1]].head()

Out[23]:
            Price
   Symbol        
   MMM     141.14
   ABT      39.60
   ABBV     53.95
   ACN      79.79
   ACE     102.91

In [24]:
   # it's a DataFrame, not a Series
   type(sp500[[1]].head())

Out[24]:
   pandas.core.frame.DataFrame

Note that even though we asked for just a single column by position, the value was still in a list passed to the [] operator and hence the double set of brackets [[]]. This is important, as not passing a list always results in a value based lookup of the column.

The following code, therefore, throws an exception as the columns index does not have a value of 1:

In [25]:
   # this is an exception, hence it is commented
   # this tries to find a column named '1'
   # not the row at position 1
   # df = sp500[1]

But, this would work if the dtype of the columns index is an integer and it has a value 1:

In [26]:
   # create a new DataFrame with integers as the column names
   # make sure to use .copy() or change will be in-place
   df = sp500.copy()
   df.columns=[0, 1, 2]
   df.head()

Out[26]:
                                0       1       2
   Symbol                                        
   MMM                Industrials  141.14  26.668
   ABT                Health Care   39.60  15.573
   ABBV               Health Care   53.95   2.954
   ACN     Information Technology   79.79   8.326
   ACE                 Financials  102.91  86.897

In [27]:
   # this is not an exception
   df[1]

Out[27]:
   Symbol
   MMM       141.14
   ABT        39.60
   ABBV       53.95
   ...
   ZMH       101.84
   ZION       28.43
   ZTS        30.53
   Name: 1, Length: 500, dtype: float64

In [28]:
   # because the column names are actually integers
   # and therefore [1] is found as a column
   df.columns

Out[28]:
   Int64Index([0, 1, 2], dtype='int64')

In the preceding code, notice the selection of a single column using a single integer; when the DataFrame has an integer column index, it returns a Series and not a DataFrame.

In [29]:
   # this is a Series not a DataFrame
   type(df[1])

Out[29]:
   pandas.core.series.Series

If you need a DataFrame, you can pass this Series to the constructor of a new DataFrame object, or pass the single value in a list.

If the values passed to [] consist of nonintegers, then the DataFrame will attempt to match the values to the values in the columns index. The following code retrieves the Price column by name:

In [30]:
   # get price column by name
   # result is a Series
   sp500['Price']

Out[30]:
   Symbol
   MMM       141.14
   ABT        39.60
   ABBV       53.95
   ...
   ZMH       101.84
   ZION       28.43
   ZTS        30.53
   Name: Price, Length: 500, dtype: float64

Like with a column selection with a single location, this returned a Series and not a DataFrame object.

Multiple columns can be selected by name by passing a list of the values, and results in a DataFrame (even if a single item is passed in the list). The code is as follows:

In [31]:
   # get Price and Sector columns
   # since a list is passed, the result is a DataFrame
   sp500[['Price', 'Sector']]

Out[31]:
            Price                  Sector
   Symbol                                
   MMM     141.14             Industrials
   ABT      39.60             Health Care
   ABBV     53.95             Health Care
   ACN      79.79  Information Technology
   ACE     102.91              Financials
   ...        ...                     ...
   YHOO     35.02  Information Technology
   YUM      74.77  Consumer Discretionary
   ZMH     101.84             Health Care
   ZION     28.43              Financials
   ZTS      30.53             Health Care

   [500 rows x 2 columns]

Columns can also be retrieved using what is referred to as attribute access. Each column in a DataFrame dynamically adds a property to the DataFrame for each column where the name of the property is the name of the column. Since this selects a single column, the resulting value is a Series:

In [32]:
   # attribute access of the column by name
   sp500.Price

Out[32]:
   Symbol
   MMM       141.14
   ABT        39.60
   ABBV       53.95
   ...
   ZMH       101.84
   ZION       28.43
   ZTS        30.53
   Name: Price, Length: 500, dtype: float64

Note that this will not work for the Book Value column, as the name has a space.

If you do want to find the zero-based location of one or more columns using the name of the column (technically, the value of the index entry of a column), use the .get_loc() method of the columns index:

In [33]:
   # get the position of the column with the value of Price
   loc = sp500.columns.get_loc('Price')
   loc

Out[33]:
   1
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.19.243