Discretization and binning

Often when working with continuous datasets, we need to convert them into discrete or interval forms. Each interval is referred to as a bin, and hence the name binning comes into play:

  1. Let's say we have data on the heights of a group of students as follows:
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

And we want to convert that dataset into intervals of 118 to 125, 126 to 135, 136 to 160, and finally 160 and higher.

  1. To convert the preceding dataset into intervals, we can use the cut() method provided by the pandas library:
bins = [118, 125, 135, 160, 200]
category = pd.cut(height, bins)
category

The output of the preceding code is as follows:

[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160, 200], (135, 160], (135, 160], (125, 135]] Length: 12 Categories (4, interval[int64]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]

If you look closely at the output, you'll see that there are mathematical notations for intervals. Do you recall what these parentheses mean from your elementary mathematics class? If not, here is a quick recap:

  • A parenthesis indicates that the side is open. 
  • A square bracket means that it is closed or inclusive. 

From the preceding code block, (118, 125] means the left-hand side is open and the right-hand side is closed. This is mathematically denoted as follows:

Hence, 118 is not included, but anything greater than 118 is included, while 125 is included in the interval.

  1. We can set a right=False argument to change the form of interval:
category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)
category2

And the output of the preceding code is as follows:

[[118, 126), [118, 126), [118, 126), [126, 136), [118, 126), ..., [126, 136), [161, 200), [136, 161), [136, 161), [126, 136)] Length: 12 Categories (4, interval[int64]): [[118, 126) < [126, 136) < [136, 161) < [161, 200)]

Note that the output form of closeness has been changed. Now, the results are in the form of right-closed, left-open.

  1. We can check the number of values in each bin by using the pd.value_counts() method:
pd.value_counts(category)

And the output is as follows:

(118, 125] 5
(135, 160] 3
(125, 135] 3
(160, 200] 1
dtype: int64

The output shows that there are five values in the interval [118-125).

  1. We can also indicate the bin names by passing a list of labels:
bin_names = ['Short Height', 'Average height', 'Good Height', 'Taller']
pd.cut(height, bins, labels=bin_names)

And the output is as follows:

[Short Height, Short Height, Short Height, Average height, Short Height, ..., Average height, Taller, Good Height, Good Height, Average height]
Length: 12
Categories (4, object): [Short Height < Average height < Good Height < Taller]

Note that we have passed at least two arguments, the data that needs to be discretized and the required number of bins. Furthermore, we have used a right=False argument to change the form of interval.

  1. Now, it is essential to note that if we pass just an integer for our bins, it will compute equal-length bins based on the minimum and maximum values in the data. Okay, let's verify what we mentioned here:
import numpy as np
pd.cut(np.random.rand(40), 5, precision=2)

In the preceding code, we have just passed 5 as the number of required bins, and the output of the preceding code is as follows:

[(0.81, 0.99], (0.094, 0.27], (0.81, 0.99], (0.45, 0.63], (0.63, 0.81], ..., (0.81, 0.99], (0.45, 0.63], (0.45, 0.63], (0.81, 0.99], (0.81, 0.99]] Length: 40
Categories (5, interval[float64]): [(0.094, 0.27] < (0.27, 0.45] < (0.45, 0.63] < (0.63, 0.81] < (0.81, 0.99]]

We can see, based on the number of bins, it created five categories. There isn't anything here that you don't understand, right? Good work so far. Now, let's take this one step further. Another technical term of interest to us from mathematics is quantiles. Remember the concept? If not, don't worry, as we are going to learn about quantiles and other measures in Chapter 5, Descriptive Statistics. For now, it is sufficient to understand that quantiles divide the range of a probability distribution into continuous intervals with alike probabilities.  

Pandas provides a qcut method that forms the bins based on sample quantiles. Let's check this with an example:

randomNumbers = np.random.rand(2000)
category3 = pd.qcut(randomNumbers, 4) # cut into quartiles
category3

And the output of the preceding code is as follows:

[(0.77, 0.999], (0.261, 0.52], (0.261, 0.52], (-0.000565, 0.261], (-0.000565, 0.261], ..., (0.77, 0.999], (0.77, 0.999], (0.261, 0.52], (-0.000565, 0.261], (0.261, 0.52]]
Length: 2000
Categories (4, interval[float64]): [(-0.000565, 0.261] < (0.261, 0.52] < (0.52, 0.77] < (0.77, 0.999]]

Note that based on the number of bins, which we set to 4, it converted our data into four different categories. If we count the number of values in each category, we should get equal-sized bins as per our definition. Let's verify that with the following command:

pd.value_counts(category3)

And the output of the command is as follows:

(0.77, 0.999] 500
(0.52, 0.77] 500
(0.261, 0.52] 500
(-0.000565, 0.261] 500
dtype: int64

Our claim is hence verified. Each category contains an equal size of 500 values. Note that, similar to cut, we can also pass our own bins:

pd.qcut(randomNumbers, [0, 0.3, 0.5, 0.7, 1.0])

And the output of the preceding code is as follows:

[(0.722, 0.999], (-0.000565, 0.309], (0.309, 0.52], (-0.000565, 0.309], (-0.000565, 0.309], ..., (0.722, 0.999], (0.722, 0.999], (0.309, 0.52], (-0.000565, 0.309], (0.309, 0.52]] Length: 2000
Categories (4, interval[float64]): [(-0.000565, 0.309] < (0.309, 0.52] < (0.52, 0.722] < (0.722, 0.999]]

Note that it created four different categories based on our code. Congratulations! We successfully learned how to convert continuous datasets into discrete datasets. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.13.164