Boolean indexing

We use Boolean indexing to filter or select parts of the data. The operators are as follows:

Operators

Symbol

OR

|

AND

&

NOT

~

These operators must be grouped using parentheses when used together. Using the earlier DataFrame from the previous section, here, we display the trading dates for which the NASDAQ closed above 4300:

In [311]: sharesIndexDataDF.ix[(sharesIndexDataDF['PriceType']=='close') & 
                     (sharesIndexDataDF['Nasdaq']>4300) ]
Out[311]:        PriceType  Nasdaq   S&P 500   Russell 2000
  TradingDate
  2014/02/27   close  4318.93   1854.29   1187.94
  2014/02/28   close  4308.12   1859.45   1183.03
  2 rows × 4 columns

You can also create Boolean conditions in which you use arrays to filter out parts of the data:

In [316]: highSelection=sharesIndexDataDF['PriceType']=='high'
          NasdaqHigh=sharesIndexDataDF['Nasdaq']<4300
          sharesIndexDataDF.ix[highSelection & NasdaqHigh]
Out[316]: TradingDate  PriceType Nasdaq  S&P 500  Russell 2000
          2014/02/21    high     4284.85  1846.13  1168.43

Thus, the preceding code snippet displays the only date in the dataset for which the Nasdaq Composite index stayed below the 4300 level for the entire trading session.

The is in and any all methods

These methods enable the user to achieve more with Boolean indexing than the standard operators used in the preceding sections. The isin method takes a list of values and returns a Boolean array with True at the positions within the Series or DataFrame that match the values in the list. This enables the user to check for the presence of one or more elements within a Series. Here is an illustration using Series:

In [317]:stockSeries=pd.Series(['NFLX','AMZN','GOOG','FB','TWTR'])
           stockSeries.isin(['AMZN','FB'])
Out[317]:0    False
        1     True
        2    False
        3     True
        4    False
        dtype: bool

Here, we use the Boolean array to select a sub-Series containing the values that we're interested in:

In [318]: stockSeries[stockSeries.isin(['AMZN','FB'])]
Out[318]: 1    AMZN
         3      FB
        dtype: object

For our DataFrame example, we switch to a more interesting dataset for those of us who are of a biological anthropology bent, that of classifying Australian mammals (a pet interest of mine):

In [324]: australianMammals=
              {'kangaroo': {'Subclass':'marsupial', 
                              'Species Origin':'native'},
               'flying fox' : {'Subclass':'placental', 
                               'Species Origin':'native'},
               'black rat': {'Subclass':'placental', 
                             'Species Origin':'invasive'},
               'platypus' : {'Subclass':'monotreme', 
                             'Species Origin':'native'},
               'wallaby' :  {'Subclass':'marsupial', 
                             'Species Origin':'native'},
        'palm squirrel' : {'Subclass':'placental', 
                           'Origin':'invasive'},
        'anteater':     {'Subclass':'monotreme', 'Origin':'native'},
        'koala':        {'Subclass':'marsupial', 'Origin':'native'}
}

Note

Some more information on mammals: Marsupials are pouched mammals, monotremes are egg-laying, and placentals give birth to live young. The source of this information is the following: http://en.wikipedia.org/wiki/List_of_mammals_of_Australia.

The is in and any all methods

The source of the preceding image is Bennett's wallaby at http://bit.ly/NG4R7N.

In [328]: ozzieMammalsDF=pd.DataFrame(australianMammals)
In [346]: aussieMammalsDF=ozzieMammalsDF.T; aussieMammalsDF
Out[346]:       Subclass  Origin
  anteater      monotreme	 native
  black rat     placental   invasive
  flying fox    placental   native
  kangaroo      marsupial   native
  koala          marsupial   native
  palm squirrel placental   invasive
  platypus      monotreme	 native
  wallaby   marsupial   native
  8 rows × 2 columns

Let us try to select mammals that are native to Australia:

In [348]: aussieMammalsDF.isin({'Subclass':['marsupial'],'Origin':['native']})
Out[348]:    Subclass Origin
  anteater   False   True
  black rat   False   False
  flying fox   False   True
  kangaroo   True   True
  koala      True   True
  palm squirrel False False
  platypus   False   True
  wallaby   True   True
  8 rows × 2 columns

The set of values passed to isin can be an array or a dictionary. That works somewhat, but we can achieve better results by creating a mask as a combination of the isin and all() methods:

In [349]: nativeMarsupials={'Mammal Subclass':['marsupial'],
                            'Species Origin':['native']}
       nativeMarsupialMask=aussieMammalsDF.isin(nativeMarsupials).all(True)
    aussieMammalsDF[nativeMarsupialMask]
Out[349]:      Subclass   Origin
  kangaroo  marsupial  native
  koala      marsupial  native
  wallaby   marsupial  native
  3 rows × 2 columns

Thus, we see that kangaroo, koala, and wallaby are the native marsupials in our dataset. The any() method returns whether any element is True in a Boolean DataFrame. The all() method filters return whether all elements are True in a Boolean DataFrame.

The source for this is http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.any.html.

Using the where() method

The where() method is used to ensure that the result of Boolean filtering is the same shape as the original data. First, we set the random number generator seed to 100 so that the user can generate the same values as shown next:

In [379]: np.random.seed(100)
       normvals=pd.Series([np.random.normal() for i in np.arange(10)])
    normvals
Out[379]: 0   -1.749765
          1    0.342680
          2    1.153036
          3   -0.252436
          4    0.981321
          5    0.514219
          6    0.221180
          7   -1.070043
          8   -0.189496
          9    0.255001
    dtype: float64

In [381]: normvals[normvals>0]
Out[381]: 1    0.342680
          2    1.153036
          4    0.981321
          5    0.514219
          6    0.221180
          9    0.255001
    dtype: float64

In [382]: normvals.where(normvals>0)
Out[382]: 0         NaN
          1    0.342680
          2    1.153036
          3         NaN
          4    0.981321
          5    0.514219
          6    0.221180
          7         NaN
          8         NaN
          9    0.255001
    dtype: float64

This method appears to be useful only in the case of a Series, as we get this behavior for free in the case of a DataFrame:

In [393]: np.random.seed(100) 
       normDF=pd.DataFrame([[round(np.random.normal(),3) for i in np.arange(5)] for j in range(3)], 
             columns=['0','30','60','90','120'])
    normDF
Out[393]:  0  30  60  90  120
           0  -1.750   0.343   1.153  -0.252   0.981
           1   0.514   0.221  -1.070  -0.189   0.255
           2  -0.458   0.435  -0.584   0.817   0.673
           3 rows × 5 columns
In [394]: normDF[normDF>0]
Out[394]:  0  30  60  90  120
           0   NaN   0.343   1.153   NaN   0.981
           1   0.514   0.221   NaN	  NaN   0.255
           2   NaN   0.435   NaN   0.817   0.673
           3 rows × 5 columns
In [395]: normDF.where(normDF>0)
Out[395]:  0  30  60  90  120
           0   NaN     0.343   1.153   NaN   0.981
           1   0.514   0.221   NaN     NaN   0.255
           2   NaN     0.435   NaN     0.817 0.673
           3   rows × 5 columns

The inverse operation of the where method is mask:

In [396]: normDF.mask(normDF>0)
Out[396]:  0  30  60  90  120
           0  -1.750  NaN   NaN    -0.252  NaN
           1   NaN    NaN  -1.070  -0.189  NaN
           2  -0.458  NaN  -0.584   NaN    NaN
           3  rows × 5 columns

Operations on indexes

To complete this chapter, we will discuss operations on indexes. We sometimes need to operate on indexes when we wish to re-align our data or select it in different ways. There are various operations:

The set_index - allows for the creation of an index on an existing DataFrame and returns an indexed DataFrame.

As we have seen before:

In [939]: stockIndexDataDF=pd.read_csv('./stock_index_data.csv')
In [940]: stockIndexDataDF
Out[940]:   TradingDate  Nasdaq   S&P 500  Russell 2000
  0         2014/01/30   4123.13  1794.19  1139.36
  1         2014/01/31   4103.88  1782.59  1130.88
  2         2014/02/03   3996.96  1741.89  1094.58
  3         2014/02/04   4031.52  1755.20  1102.84
  4         2014/02/05   4011.55  1751.64  1093.59
  5         2014/02/06   4057.12  1773.43  1103.93

Now, we can set the index as follows:

In [941]: stockIndexDF=stockIndexDataDF.set_index('TradingDate')
In [942]: stockIndexDF
Out[942]:    Nasdaq   S&P 500  Russell 2000
  TradingDate
  2014/01/30  4123.13   1794.19  1139.36
  2014/01/31	4103.88   1782.59  1130.88
  2014/02/03  3996.96   1741.89  1094.58
  2014/02/04  4031.52   1755.20  1102.84
  2014/02/05  4011.55   1751.64  1093.59
  2014/02/06  4057.12   1773.43  1103.93

The reset_index reverses set_index:

In [409]: stockIndexDF.reset_index()
Out[409]:    
   TradingDate   Nasdaq   S&P 500  Russell 2000
0   2014/01/30   4123.13   1794.19   1139.36
1   2014/01/31   4103.88   1782.59   1130.88
2   2014/02/03   3996.96   1741.89   1094.58
3   2014/02/04   4031.52   1755.20   1102.84
4   2014/02/05   4011.55   1751.64   1093.59
5   2014/02/06   4057.12   1773.43   1103.93
6 rows × 4 columns
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.42.243