Until now, we dealt with problems where we had a similar number of datapoints in all our classes. In the real world, we might not be able to get data in such an orderly fashion. Sometimes, the number of datapoints in one class is a lot more than the number of datapoints in other classes. If this happens, then the classifier tends to get biased. The boundary won't reflect of the true nature of your data just because there is a big difference in the number of datapoints between the two classes. Therefore, it becomes important to account for this discrepancy and neutralize it so that our classifier remains impartial.
input_file = 'data_multivar_imbalance.txt' X, y = utilities.load_data(input_file)
svm_imbalance.py
already provided to you. If you run it, you will see the following figure:Class-0
. You will also see a classification report printed on your Terminal, as shown in the following screenshot:As we expected, Class-0
has 0% precision.
params = {'kernel': 'linear'}
Replace the preceding line with the following:
params = {'kernel': 'linear', 'class_weight': 'auto'}
The class_weight
parameter will count the number of datapoints in each class to adjust the weights so that the imbalance doesn't adversely affect the performance.
As we can see, Class-0
is now detected with nonzero percentage accuracy.
3.19.29.89