We just saw how to fix an excess of zeros without directly modeling the factor that generates them. A similar approach, suggested by Kruschke, can be used to perform a more robust version of logistic regression. Remember that in logistic regression, we model the data as binomial, that is, zeros and ones. So it may happen that we find a dataset with unusual zeros and/or ones. Take, as an example, the iris dataset that we already saw, but with some added intruders:
iris = sns.load_dataset("iris")
df = iris.query("species == ('setosa', 'versicolor')")
y_0 = pd.Categorical(df['species']).codes
x_n = 'sepal_length'
x_0 = df[x_n].values
y_0 = np.concatenate((y_0, np.ones(6, dtype=int)))
x_0 = np.concatenate((x_0, [4.2, 4.5, 4.0, 4.3, 4.2, 4.4]))
x_c = x_0 - x_0.mean()
plt.plot(x_c, y_0, 'o', color='k');
Here, we have some versicolors (1s) with an unusually short sepal length. We can fix this with a mixture model. We are going to say that the output variable comes with the probability by random guessing, or with the probability from a logistic regression model. Mathematically we have:
Notice that when , we get , and for , we recover the expression for logistic regression. Implementing this model is a straightforward modification of the first model from this chapter:
with pm.Model() as model_rlg:
α = pm.Normal('α', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=10)
μ = α + x_c * β
θ = pm.Deterministic('θ', pm.math.sigmoid(μ))
bd = pm.Deterministic('bd', -α/β)
π = pm.Beta('π', 1., 1.)
p = π * 0.5 + (1 - π) * θ
yl = pm.Bernoulli('yl', p=p, observed=y_0)
trace_rlg = pm.sample(1000)
If we compare these results with those from model_0 (the first model in this chapter), we will see that we get more or less the same boundary. As we can see by comparing Figure 4.13 with Figure 4.4:
You may also want to compute the summary for model_0 and model_rlg to compare the values of the boundary according to each model.