Robust logistic regression

We just saw how to fix an excess of zeros without directly modeling the factor that generates them. A similar approach, suggested by Kruschke, can be used to perform a more robust version of logistic regression. Remember that in logistic regression, we model the data as binomial, that is, zeros and ones. So it may happen that we find a dataset with unusual zeros and/or ones. Take, as an example, the iris dataset that we already saw, but with some added intruders:

iris = sns.load_dataset("iris") 
df = iris.query("species == ('setosa', 'versicolor')")
y_0 = pd.Categorical(df['species']).codes
x_n = 'sepal_length'
x_0 = df[x_n].values
y_0 = np.concatenate((y_0, np.ones(6, dtype=int)))
x_0 = np.concatenate((x_0, [4.2, 4.5, 4.0, 4.3, 4.2, 4.4]))
x_c = x_0 - x_0.mean()
plt.plot(x_c, y_0, 'o', color='k');

Here, we have some versicolors (1s) with an unusually short sepal length. We can fix this with a mixture model. We are going to say that the output variable comes with the  probability by random guessing, or with the probability from a logistic regression model. Mathematically we have:

Notice that when , we get , and for , we recover the expression for logistic regression. Implementing this model is a straightforward modification of the first model from this chapter:

with pm.Model() as model_rlg:
α = pm.Normal('α', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=10)

μ = α + x_c * β
θ = pm.Deterministic('θ', pm.math.sigmoid(μ))
bd = pm.Deterministic('bd', -α/β)

π = pm.Beta('π', 1., 1.)
p = π * 0.5 + (1 - π) * θ

yl = pm.Bernoulli('yl', p=p, observed=y_0)

trace_rlg = pm.sample(1000)

If we compare these results with those from model_0 (the first model in this chapter), we will see that we get more or less the same boundary. As we can see by comparing Figure 4.13 with Figure 4.4:

Figure 4.13

You may also want to compute the summary for model_0 and model_rlg to compare the values of the boundary according to each model.

