Softmax regression

One way to generalize logistic regression to more than two classes is with softmax regression. We need to introduce two changes with respect to logistic regression; first, we replace the logistic function with the softmax function:

In other words, to obtain the output of the softmax function for the i-esim element of a vector, , we take the exponential of the i-esim value divided by the sum of all the exponentiated values in the vector.

Softmax guarantees we will get positive values that sum up to 1. The softmax function is reduced to the logistic function when . As a side note, the softmax function has the same form as the Boltzmann distribution used in statistical mechanics, which is a very powerful branch of physics dealing with the probabilistic description of atomic and molecular systems. The Boltzmann distribution (and softmax, in some fields) has a parameter called temperature, , that divides ; when , the probability distribution becomes flat and all states are equally likely, and when , only the most probable state gets populated, and thus softmax behaves like a max function.

The second modification is that we replace the Bernoulli distribution with the categorical distribution. The categorical distribution is the generalization of the Bernoulli to more than two outcomes. Also, as the Bernoulli distribution (single coin flip) is a special case of the Binomial ( coin flips), the categorical (single roll of a die) is a special case of the multinomial distribution ( rolls of a die). You may try this brain teaser with your nieces and nephews!

To exemplify the softmax regression, we are going to continue working with the iris dataset, only this time we are going to use its three classes (setosa, versicolor, and virginica) and its four features (sepal length, sepal width, petal length, and petal width). We are also going to standardize the data, since this will help the sampler to run more efficiently (we could have also just centered the data):

iris = sns.load_dataset('iris')
y_s = pd.Categorical(iris['species']).codes
x_n = iris.columns[:-1]
x_s = iris[x_n].values
x_s = (x_s - x_s.mean(axis=0)) / x_s.std(axis=0)

The PyMC3 code reflects the changes between the logistic and softmax models. Notice the shapes of the and coefficients. We use the softmax function from Theano; we have used the import theano.tensor as tt idiom, which is the convention used by PyMC3 developers:

with pm.Model() as model_s:
    α = pm.Normal('α', mu=0, sd=5, shape=3)
    β = pm.Normal('β', mu=0, sd=5, shape=(4,3))
    μ = pm.Deterministic('μ', α + pm.math.dot(x_s, β))
    θ = tt.nnet.softmax(μ)
    yl = pm.Categorical('yl', p=θ, observed=y_s)
    trace_s = pm.sample(2000)

How well does our model perform? Let's find out by checking how many cases we can predict correctly. In the following code, we just use the mean of the parameters to compute the probability of each data point belonging to each of the three classes, then we assign the class by using the argmax function. And we compare the result with the observed values:

data_pred = trace_s['μ'].mean(0)
y_pred = [np.exp(point)/np.sum(np.exp(point), axis=0)
 for point in data_pred]
f'{np.sum(y_s == np.argmax(y_pred, axis=1)) / len(y_s):.2f}'

The result is that we classify of the data points correctly, that is, we miss only three cases. This is very good. Nevertheless, a true test to evaluate the performance of our model will be to check it on data not used to fit the model. Otherwise, we may be overestimating the ability of the model to generalize to other data. We will discuss this subject in detail in Chapter 5, Modeling with Linear Regression. For now, we will leave this just as an auto-consistency test indicating that the model runs OK.

You may have noticed that the posterior, or more properly the marginal distributions of each parameter, are very wide; in fact, they are as wide as indicated by the priors. Even when we were able to make correct predictions, this does not look OK. This is the same non-identifiability problem we already encountered for correlated data in other regression models or with perfectly separable classes. In this case, the wide posterior is due to the condition that all probabilities must sum to one. Given this condition, we are using more parameters than we need to fully specify the model. In simple terms, if you have ten numbers that sum to one, you just need to give me nine of them; the other I can compute. One solution is to fix the extra parameters to some value, for example, zero. The following code shows how to achieve this using PyMC3:

with pm.Model() as model_sf:
    α = pm.Normal('α', mu=0, sd=2, shape=2)
    β = pm.Normal('β', mu=0, sd=2, shape=(4,2))
    α_f = tt.concatenate([[0] ,α])
    β_f = tt.concatenate([np.zeros((4,1)) , β], axis=1)
    μ = α_f + pm.math.dot(x_s, β_f)
    θ = tt.nnet.softmax(μ)
    yl = pm.Categorical('yl', p=θ, observed=y_s)
    trace_sf = pm.sample(1000)

Table of Contents for Softmax regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Softmax regression