Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tackling class imbalance

Until now, we dealt with problems where we had a similar number of datapoints in all our classes. In the real world, we might not be able to get data in such an orderly fashion. Sometimes, the number of datapoints in one class is a lot more than the number of datapoints in other classes. If this happens, then the classifier tends to get biased. The boundary won't reflect of the true nature of your data just because there is a big difference in the number of datapoints between the two classes. Therefore, it becomes important to account for this discrepancy and neutralize it so that our classifier remains impartial.

How to do it…

Let's load the data:

input_file = 'data_multivar_imbalance.txt'
X, y = utilities.load_data(input_file)

Let's visualize the data. The code for visualization is exactly the same as it was in the previous recipe. You can also find it in the file named svm_imbalance.py already provided to you. If you run it, you will see the following figure:
Let's build an SVM with a linear kernel. The code is the same as it was in the previous recipe. If you run it, you will see the following figure:
You might wonder why there's no boundary here! Well, this is because the classifier is unable to separate the two classes at all, resulting in 0% accuracy for Class-0. You will also see a classification report printed on your Terminal, as shown in the following screenshot:
As we expected, Class-0 has 0% precision.
Let's go ahead and fix this! In the Python file, search for the following line:
```
params = {'kernel': 'linear'} 
```
Replace the preceding line with the following:
```
params = {'kernel': 'linear', 'class_weight': 'auto'} 
```
The class_weight parameter will count the number of datapoints in each class to adjust the weights so that the imbalance doesn't adversely affect the performance.
You will get the following figure once you run this code:
Let's look at the classification report, as follows:

As we can see, Class-0 is now detected with nonzero percentage accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Tackling class imbalance

Create new playlist

Sign In

Sign Up

Tackling class imbalance

How to do it…

Table of Contents for
Tackling class imbalance