Chapter 4. Operations in pandas, Part I – Indexing and Selecting

In this chapter, we will focus on the indexing and selection of data from pandas objects. This is important since effective use of pandas requires a good knowledge of the indexing and selection of data. The topics that we will address in this chapter include the following:

  • Basic indexing
  • Label, integer, and mixed indexing
  • MultiIndexing
  • Boolean indexing
  • Operations on indexes

Basic indexing

We have already discussed basic indexing on Series and DataFrames in the previous chapter, but here we will include some examples for the sake of completeness. Here, we list a time series of crude oil spot prices for the 4 quarters of 2013, taken from IMF data: http://www.imf.org/external/np/res/commod/pdf/monthly/011014.pdf.

In [642]:SpotCrudePrices_2013_Data={
                'U.K. Brent' : {'2013-Q1':112.9, '2013-Q2':103.0, '2013-Q3':110.1, '2013-Q4':109.4},
                'Dubai':{'2013-Q1':108.1, '2013-Q2':100.8, '2013-Q3':106.1,'2013-Q4':106.7},
                'West Texas Intermediate':{'2013-Q1':94.4, '2013-Q2':94.2, '2013-Q3':105.8,'2013-Q4':97.4}}
                
             SpotCrudePrices_2013=pd.DataFrame.from_dict(SpotCrudePrices_2013_Data)
          SpotCrudePrices_2013
Out[642]:        Dubai   U.K. Brent  West Texas Intermediate
        2013-Q1  108.1   112.9          94.4
        2013-Q2  100.8   103.0          94.2
        2013-Q3  106.1   110.1          105.8
        2013-Q4  106.7   109.4          97.4

We can select the prices for the available time periods of Dubai crude oil by using the [] operator:

In [644]: dubaiPrices=SpotCrudePrices_2013['Dubai']; dubaiPrices
Out[644]: 2013-Q1    108.1
          2013-Q2    100.8
          2013-Q3    106.1
          2013-Q4    106.7
          Name: Dubai, dtype: float64

We can pass a list of columns to the [] operator in order to select the columns in a particular order:

In [647]: SpotCrudePrices_2013[['West Texas Intermediate','U.K. Brent']]
Out[647]:          West Texas Intermediate        U.K. Brent
         2013-Q1                      94.4        112.9
         2013-Q2                      94.2        103.0
         2013-Q3                     105.8        110.1
         2013-Q4                      97.4        109.4

If we specify a column that is not listed in the DataFrame, we will get a KeyError exception:

In [649]: SpotCrudePrices_2013['Brent Blend']
          --------------------------------------------------------
          KeyError                                  Traceback (most recent call last)
          <ipython-input-649-cd2d76b24875> in <module>()
          ...
          KeyError: u'no item named Brent Blend'

We can avoid this error by using the get operator and specifying a default value in the case when the column is not present, as follows:

In [650]: SpotCrudePrices_2013.get('Brent Blend','N/A')
Out[650]: 'N/A'

Note

Note that rows cannot be selected with the bracket operator [] in a DataFrame.

Hence, we get an error in the following case:

In [755]:SpotCrudePrices_2013['2013-Q1']
         --------------------------------------------------
         KeyError        Traceback (most recent call last)
         ...
         KeyError: u'no item named 2013-Q1'

This was a design decision made by the creators in order to avoid ambiguity. In the case of a Series, where there is no ambiguity, selecting rows by using the [] operator works:

In [756]: dubaiPrices['2013-Q1']
Out[756]: 108.1

We shall see how we can perform row selection by using one of the newer indexing operators later in this chapter.

Accessing attributes using dot operator

One can retrieve values from a Series, DataFrame, or Panel directly as an attribute as follows:

In [650]: SpotCrudePrices_2013.Dubai
Out[650]: 2013-Q1    108.1
          2013-Q2    100.8
          2013-Q3    106.1
          2013-Q4    106.7
          Name: Dubai, dtype: float64

However, this only works if the index element is a valid Python identifier as follows:

In [653]: SpotCrudePrices_2013."West Texas Intermediate"
            File "<ipython-input-653-2a782563c15a>", line 1
            SpotCrudePrices_2013."West Texas Intermediate"
                                                 ^
            SyntaxError: invalid syntax

Otherwise, we get SyntaxError as in the preceding case because of the space in the column name. A valid Python identifier must follow the following lexical convention:

identifier::= (letter|"_") (letter | digit | "_")*

Thus, a valid Python identifier cannot contain a space. See the Python Lexical Analysis documents for more details at http://docs.python.org/2.7/reference/lexical_analysis.html#identifiers.

We can resolve this by renaming the column index names so that they are all valid identifiers:

In [654]: SpotCrudePrices_2013
Out[654]:               Dubai    U.K. Brent     West Texas Intermediate
         2013-Q1        108.1    112.9          94.4
         2013-Q2        100.8    103.0          94.2
         2013-Q3        106.1    110.1          105.8
         2013-Q4        106.7    109.4          97.4

In [655]:SpotCrudePrices_2013.columns=['Dubai','UK_Brent', 
                                       'West_Texas_Intermediate']
SpotCrudePrices_2013
Out[655]:         Dubai    UK_Brent       West_Texas_Intermediate
         2013-Q1  108.1    112.9          94.4
         2013-Q2  100.8    103.0          94.2
         2013-Q3  106.1    110.1          105.8
         2013-Q4  106.7    109.4          97.4

We can then select the prices for West Texas Intermediate as desired:

In [656]:SpotCrudePrices_2013.West_Texas_Intermediate
Out[656]:2013-Q1     94.4
         2013-Q2     94.2
         2013-Q3    105.8
         2013-Q4     97.4
         Name: West_Texas_Intermediate, dtype: float64

We can also select prices by specifying a column index number to select column 1 (U.K. Brent) as follows:

In [18]: SpotCrudePrices_2013[[1]]
Out[18]:        U.K. Brent
        2013-Q1  112.9
        2013-Q2  103.0
        2013-Q3  110.1
        2013-Q4  109.4

Range slicing

As we saw in the section on NumPy ndarrays in Chapter 3, The pandas Data structures, we can slice a range by using the [] operator. The syntax of the slicing operator exactly matches that of NumPy:

ar[startIndex: endIndex: stepValue]

where the default values if not specified are as follows:

  • 0 for startIndex
  • arraysize -1 for endIndex
  • 1 for stepValue

For a DataFrame, [] slices across rows as follows:

Obtain the first 2 rows:

In [675]: SpotCrudePrices_2013[:2]
Out[675]:        Dubai  UK_Brent West_Texas_Intermediate
        2013-Q1  108.1   112.9   94.4
        2013-Q2  100.8   103.0   94.2

Obtain all rows starting from index 2:

In [662]: SpotCrudePrices_2013[2:]
Out[662]:        Dubai  UK_Brent West_Texas_Intermediate
        2013-Q3  106.1   110.1   105.8
        2013-Q4  106.7   109.4   97.4

Obtain rows at intervals of two, starting from row 0:

In [664]: SpotCrudePrices_2013[::2]
Out[664]:        Dubai  UK_Brent West_Texas_Intermediate
        2013-Q1  108.1   112.9   94.4
        2013-Q3  106.1   110.1   105.8

Reverse the order of rows in DataFrame:

In [677]: SpotCrudePrices_2013[::-1]
Out[677]:        Dubai  UK_Brent West_Texas_Intermediate
        2013-Q4  106.7   109.4   97.4
        2013-Q3  106.1   110.1   105.8
        2013-Q2  100.8   103.0   94.2
        2013-Q1  108.1   112.9   94.4

For a Series, the behavior is just as intuitive:

In [666]: dubaiPrices=SpotCrudePrices_2013['Dubai']

Obtain the last 3 rows or all rows but the first:

In [681]: dubaiPrices[1:]
Out[681]: 2013-Q2    100.8
          2013-Q3    106.1
          2013-Q4    106.7
          Name: Dubai, dtype: float64

Obtain all rows but the last:

In [682]: dubaiPrices[:-1]
Out[682]: 2013-Q1    108.1
          2013-Q2    100.8
          2013-Q3    106.1
          Name: Dubai, dtype: float64

Reverse the rows:

In [683]: dubaiPrices[::-1]
Out[683]: 2013-Q4    106.7
          2013-Q3    106.1
          2013-Q2    100.8
          2013-Q1    108.1
          Name: Dubai, dtype: float64
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.194.123