Dealing with unbalanced classes

The iris dataset is completely balanced, in the sense that each category has exactly the same number of observations. We have 50 setosas, 50 versicolors, and 50 virgininas. This is something to thank Ronald Fisher for, unlike his dedication to popularizing the use of p-values.

On the contrary, many datasets consist of unbalanced data, that is, there are many more data points from one class than from another. When this happens, logistic regression can run into trouble, namely the boundary cannot be determined as accurately as when the dataset is more balanced.

To see an example of this behavior, we are going to use the iris dataset and we are going to arbitrarily remove some data points from the setosa class:

df = iris.query("species == ('setosa', 'versicolor')") 
df = df[45:]
y_3 = pd.Categorical(df['species']).codes
x_n = ['sepal_length', 'sepal_width']
x_3 = df[x_n].values

And then, we are going to run a multiple logistic regression model, just as before:

with pm.Model() as model_3: 
α = pm.Normal('α', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=2, shape=len(x_n))

μ = α + pm.math.dot(x_3, β)
θ = 1 / (1 + pm.math.exp(-μ))
bd = pm.Deterministic('bd', -α/β[1] - β[0]/β[1] * x_3[:,0])

yl = pm.Bernoulli('yl', p=θ, observed=y_3)

trace_3 = pm.sample(1000)

As we can see from Figure 4.8, the boundary decision is now shifted toward the less abundant class, and the uncertainty is larger than before. This is the typical behavior of a logistic model for unbalanced data. But wait a minute! You may be arguing that I am cheating here since the wider uncertainty could be the product of having less total data and not just fewer setosas than versicolors! That could be a valid point; try with exercise 6 to verify that what explains this plot is the unbalanced data:

idx = np.argsort(x_3[:,0]) 
bd = trace_3['bd'].mean(0)[idx]
plt.scatter(x_3[:,0], x_3[:,1], c= [f'C{x}' for x in y_3])
plt.plot(x_3[:,0][idx], bd, color='k')

az.plot_hpd(x_3[:,0], trace_3['bd'], color='k')

plt.xlabel(x_n[0])
plt.ylabel(x_n[1])

Figure 4.8

What do we do if we find unbalanced data? Well, the obvious solution is to get a dataset with roughly the same data points per class. This can be important to have in mind if you are collecting or generating the data. If you have no control over the dataset, you should be careful when interpreting the result for unbalanced data. Check the uncertainty of the model and run some posterior predictive checks to see whether the results are useful to you. Another option will be to input more prior information, if available, and/or run an alternative model, as explained later in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.105.114