Boolean selection

Items in a Series can be selected, based on the value instead of index labels, via the utilization of a Boolean selection. A Boolean selection applies a logical expression to the values of the Series and returns a new Series of Boolean values representing the result for each value. The following code demonstrates identifying items in a Series where the values are greater than 5:

In [58]:
   # which rows have values that are > 5?
   s = pd.Series(np.arange(0, 10))
   s > 5

Out[58]:
   0    False
   1    False
   2    False
   3    False
   4    False
   5    False
   6     True
   7     True
   8     True
   9     True
   dtype: bool

To obtain the rows in the Series where the logical expression is True, simply pass the result of the Boolean expression to the [] operator of the Series. The result will be a new Series with a copy of index and value for the selected rows:

In [59]:
   # select rows where values are > 5
   logicalResults = s > 5
   s[logicalResults]

Out[59]:
   6    6
   7    7
   8    8
   9    9
   dtype: int64

pandas performs this Boolean selection by overloading the Series object's [] operator so that when passed a Series object consisting of boolean values it knows to return only the values in the outer Series (in this cases s) where the labels in the Series object are passed to a [] operator have True values.

This is actually very similar to how selection works in R, and can feel a bit unnatural at first for someone using a procedural or statistical programming language. However, this turns out to be very valuable and efficient in expressing many types of data analysis algorithms, and very convenient for extracting subsets of data based on its contents.

There is a shortcut syntax to perform the operation. You can use the name of the Series inside of the [] operator, as follows.

In [60]:
   # a little shorter version
   s[s > 5]

Out[60]:
   6    6
   7    7
   8    8
   9    9
   dtype: int64

Unfortunately, multiple logical operators cannot be used in a normal Python syntax. As an example, the following causes an exception to be thrown:

In [61]:
   # commented as it throws an exception
   # s[s > 5 and s < 8]

There are technical reasons for why the preceding code does not work. The solution is to express the equation differently, putting parentheses around each of the logical conditions and using different operators for and/or ('|' and '&').

In [62]:
   # correct syntax
   s[(s > 5) & (s < 8)]

Out[62]:
   6    6
   7    7
   dtype: int64

It is possible to determine whether all the values in a Series match a given expression using the .all() method. The following asks if all elements in the series are greater than or equal to 0:

In [63]:
# are all items >= 0?
(s >= 0).all()

Out[63]:
   True

The .any() method returns True if any values satisfy the expressions. The following asks if any elements are less than 2:

In [64]:
   # any items < 2?
   s[s < 2].any()

Out[64]:
   True

Note

Note that I used a slightly different syntax than with the .all() example. Both are correct and you can use whichever suits your style better.

There is something important going on here that is worth mentioning. The result of these logical expressions is a Boolean selection, a Series of True and False values. The .sum() method of a Series, when given a series of Boolean values, will treat True as 1 and False as 0. The following demonstrates using this to determine the number of items in a Series that satisfy a given expression:

In [65]:
# how many values < 2?
(s < 2).sum()

Out[65]:
   2
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.218.221