The logistic model applied to the iris dataset

We are going to begin with the simplest possible classification problem: two classes, setosa and versicolor, and just one independent variable or feature, the sepal_length. As it is usually done, we are going to encode the setosa and versicolor categorical variables with the numbers 0 and 1. Using pandas, we can do the following:

df = iris.query("species == ('setosa', 'versicolor')")
y_0 = pd.Categorical(df['species']).codes
x_n = 'sepal_length' 
x_0 = df[x_n].values
x_c = x_0 - x_0.mean()

As with other linear models, centering the data can help with the sampling. Now that we have the data in the proper format, we can finally build the model with PyMC3.

Notice how the first part of model_0 resembles a linear regression model. Also pay attention to the two deterministic variables: θ and bd. θ is the output of the logistic function applied to the μ variable, and bd is the boundary decision, which is the value used to separate classes; we will discuss this later in detail. Another point worth mentioning is that instead of explicitly writing the logistic function, we are using pm.math.sigmoid (this is just an alias for the Theano function with the same name):

with pm.Model() as model_0:
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=10)
    
    μ = α + pm.math.dot(x_c, β)    
    θ = pm.Deterministic('θ', pm.math.sigmoid(μ))
    bd = pm.Deterministic('bd', -α/β)
    
    yl = pm.Bernoulli('yl', p=θ, observed=y_0)

    trace_0 = pm.sample(1000)

In order to save pages and avoid you getting bored with the same type of plot over and over again, we are going to omit doing a trace plot and other similar summaries, but I encourage you make your own plots and summaries to further explore the examples in the book. Instead, we are going to jump directly to generating Figure 4.1, a plot of the data, together with the fitted sigmoid curve and the decision boundary:

theta = trace_0['θ'].mean(axis=0)
idx = np.argsort(x_c)
plt.plot(x_c[idx], theta[idx], color='C2', lw=3)
plt.vlines(trace_0['bd'].mean(), 0, 1, color='k')
bd_hpd = az.hpd(trace_0['bd'])
plt.fill_betweenx([0, 1], bd_hpd[0], bd_hpd[1], color='k', alpha=0.5)

plt.scatter(x_c, np.random.normal(y_0, 0.02),
            marker='.', color=[f'C{x}' for x in y_0])
az.plot_hpd(x_c, trace_0['θ'], color='C2')

plt.xlabel(x_n)
plt.ylabel('θ', rotation=0)
# use original scale for xticks
locs, _ = plt.xticks()
plt.xticks(locs, np.round(locs + x_0.mean(), 1))

Figure 4.4

Figure 4.4 shows the sepal length versus the iris species (setosa = 0, versicolor = 1). To avoid over-plotting, the binary response variables are jittered. An S-shaped (green) line is the mean value of . This line can be interpreted as the probability of a flower being versicolor, given that we know the value of the sepal length. The semitransparent S-shaped (green) band is the 94% HPD interval. The boundary decision is represented as a (black) vertical line with a semi-transparent band for its 94% HPD. According to the boundary decision, the values (sepal lengths, in this case) to the left correspond to class 0 (setosa), and the values to the right to class 1 (versicolor).

The decision boundary is defined as the value of , for which . And it turns out to be , which we can derive as follows:

From the definition of the model, we have the following relationship:

And from the definition of the logistic function, we have when the argument of the logistic regression is 0:

By reordering equation 4.5, we find that the value of , for which , corresponds to the following expression:

There are a few key points to mention:

The value of is, generally speaking, . In this sense, the logistic regression is a true regression; the key detail is that we are regressing the probability that a data point belongs to class 1, given a linear combination of features.
We are modeling the mean of a dichotomous variable, that is, a number in the [0-1] interval. Then, we introduce a rule to turn this probability into a two-class assignment. In this case, if , we assign class 1, otherwise class 0.
There is nothing special about the value of 0.5, other than that it is the number in the middle of 0 and 1. We may argue this boundary is only reasonable if we are OK making a mistake in either one direction or the other—in other words, if it is the same for us to misclassify a setosa as a versicolor or a versicolor as a setosa. It turns out that this is not always the case, and the cost associated with the miss-classification does not need to be symmetrical, as you may remember from Chapter 2, Programming Probabilistically, when we discussed loss functions.

Table of Contents for The logistic model applied to the iris dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
The logistic model applied to the iris dataset