Gaussian process classification

Gaussian processes are not restricted to regression. We can also use them for classification. As we saw in Chapter 4, Generalizing Linear Models, we turn a linear model into a suitable model to classify data by using a Bernoulli likelihood with a logistic inverse link function (and then applying a boundary decision rule to separate classes). We will try to recapitulate model_0 from Chapter 4, Generalizing Linear Models, for the iris dataset, this time using a GP instead of a linear model.

Let's invite the iris dataset to the stage one more time:

iris = pd.read_csv('../data/iris.csv')
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

We are going to begin with the simplest possible classification problem: two classes, setosa and versicolor, and just one independent variable, the sepal length. As per usual, we are going to encode the categorical variables, setosa and versicolor, with the numbers 0 and 1:

df = iris.query("species == ('setosa', 'versicolor')")
y = pd.Categorical(df['species']).codes
x_1 = df['sepal_length'].values
X_1 = x_1[:, None]

For this model, instead of using the pm.gp.Marginal class to instantiate a GP prior, we are going to use the pm.gp.Latent class. While the latter is more general and can be used with any likelihood, the former is restricted to Gaussian likelihoods and has the advantage of being more efficient by taking advantage of the mathematical tractability of the combination of a GP prior with a Gaussian likelihood:

with pm.Model() as model_iris:
    ℓ = pm.Gamma('ℓ', 2, 0.5)
    cov = pm.gp.cov.ExpQuad(1, ℓ)
    gp = pm.gp.Latent(cov_func=cov)
    f = gp.prior("f", X=X_1)
    # logistic inverse link function and Bernoulli likelihood
    y_ = pm.Bernoulli("y", p=pm.math.sigmoid(f), observed=y)
    trace_iris = pm.sample(1000, chains=1, compute_convergence_checks=False)

Now that we have found the values of _, we may want to get samples from the GP posterior. As we did with the marginal_gp_model, we can also compute the conditional distribution evaluated over a set of new input locations with the help of the gp.conditional function, as shown in the following code:

X_new = np.linspace(np.floor(x_1.min()), np.ceil(x_1.max()), 200)[:, None]

with model_iris:
    f_pred = gp.conditional('f_pred', X_new)
    pred_samples = pm.sample_posterior_predictive(
        trace_iris, vars=[f_pred], samples=1000)

To show the results for this model, we are going to create a figure similar to Figure 4.4. Instead of obtaining the boundary decision analytically, we will compute it directly from f_pred, using the following convenience function:

def find_midpoint(array1, array2, value):
    """
    This should be a proper docstring :-)
    """
    array1 = np.asarray(array1)
    idx0 = np.argsort(np.abs(array1 - value))[0]
    idx1 = idx0 - 1 if array1[idx0] > value else idx0 + 1
    if idx1 == len(array1):
        idx1 -= 1
    return (array2[idx0] + array2[idx1]) / 2

The following code is very similar to that used in Chapter 4, Generalizing Linear Models, to generate Figure 4.4:

_, ax = plt.subplots(figsize=(10, 6))

fp = logistic(pred_samples['f_pred'])
fp_mean = np.mean(fp, 0)

ax.plot(X_new[:, 0], fp_mean)
# plot the data (with some jitter) and the true latent function
ax.scatter(x_1, np.random.normal(y, 0.02),
           marker='.', color=[f'C{x}' for x in y])

az.plot_hpd(X_new[:, 0], fp, color='C2')

db = np.array([find_midpoint(f, X_new[:, 0], 0.5) for f in fp])
db_mean = db.mean()
db_hpd = az.hpd(db)
ax.vlines(db_mean, 0, 1, color='k')
ax.fill_betweenx([0, 1], db_hpd[0], db_hpd[1], color='k', alpha=0.5)
ax.set_xlabel('sepal_length')
ax.set_ylabel('θ', rotation=0)
plt.savefig('B11197_07_11.png')

Figure 7.11

As we can see, Figure 7.11 looks pretty similar to Figure 4.4. The f_pred looks like a sigmoid curve, except for the tails that go up at lower values of x_1 and down at higher values of x_1. This is a consequence of the predicted function moving toward the prior when there is no data (or few data). If we only concern ourselves with the boundary decision, this should not be a real problem, but if we want to model the probabilities of belonging to setosa or versicolor for different values of sepal_length, then we should improve our model and do something to get a better model for the tails. One way to achieve this objective is to add more structure to the Gaussian process. A general way to get better Gaussian process models is by combining covariance functions in order to better capture details of the function we are trying to model.

The following model (model_iris2) is the same as model_iris, except for the covariance matrix, which we model as the combination of three kernels:

cov = K_{ExpQuad} + K_{Linear} + K_{whitenoise}(1E-5)

By adding the linear kernel, we fix the tail problem, as you can see in Figure 7.12. The white-noise kernel is just a computational trick to stabilize the computation of the covariance matrix. Kernels for Gaussian processes are restricted to guarantee the resulting covariance matrix is positive definite. Nevertheless, numerical errors can lead to violating this condition. One manifestation of this problem is that we get nans when computing posterior predictive samples of the fitted function. One way to mitigate this error is to stabilize the computation by adding a little bit of noise. As a matter of fact, PyMC3 already does this under the hood, but sometimes a little bit more noise is needed, as shown in the following code:

with pm.Model() as model_iris2:
    ℓ = pm.Gamma('ℓ', 2, 0.5)
    c = pm.Normal('c', x_1.min())
    τ = pm.HalfNormal('τ', 5)
    cov = (pm.gp.cov.ExpQuad(1, ℓ) +
           τ * pm.gp.cov.Linear(1, c) +
           pm.gp.cov.WhiteNoise(1E-5))
    gp = pm.gp.Latent(cov_func=cov)
    f = gp.prior("f", X=X_1)
    # logistic inverse link function and Bernoulli likelihood
    y_ = pm.Bernoulli("y", p=pm.math.sigmoid(f), observed=y)
    trace_iris2 = pm.sample(1000, chains=1, compute_convergence_checks=False)

Now we generate posterior predictive samples for model_iris2 for the values of X_new generated previously:

with model_iris2:
    f_pred = gp.conditional('f_pred', X_new)
    pred_samples = pm.sample_posterior_predictive(trace_iris2,
                                                  vars=[f_pred],
                                                  samples=1000)

_, ax = plt.subplots(figsize=(10,6))

fp = logistic(pred_samples['f_pred'])
fp_mean = np.mean(fp, 0)

ax.scatter(x_1, np.random.normal(y, 0.02), marker='.',
           color=[f'C{ci}' for ci in y])

db = np.array([find_midpoint(f, X_new[:,0], 0.5) for f in fp])
db_mean = db.mean()
db_hpd = az.hpd(db)
ax.vlines(db_mean, 0, 1, color='k')
ax.fill_betweenx([0, 1], db_hpd[0], db_hpd[1], color='k', alpha=0.5)

ax.plot(X_new[:,0], fp_mean, 'C2', lw=3)
az.plot_hpd(X_new[:,0], fp, color='C2')

ax.set_xlabel('sepal_length')
ax.set_ylabel('θ', rotation=0)

Figure 7.12

Now Figure 7.12 looks much more similar to Figure 4.4 than Figure 7.11. This example has two main aims:

Showing how we can easily combine kernels to get a more expressive model
Showing how we can recover a logistic regression using a Gaussian process

Regarding the second point, a logistic regression is indeed a special case of Gaussian processes, because a simple linear regression is just a particular case of a Gaussian process. In fact, many known models can be seen as special cases of GPs, or at least they are somehow connected to GPs. You can read Chapter 15 from Kevin Murphy's Machine Learning: A Probabilistic Perspective for details.

In practice, it does not make too much sense to use a GP to model a problem we can just solve with a logistic regression. Instead, we want to use a GP to model more complex data that is not well-captured with less flexible models. For example, suppose we want to model the probability of getting a disease as a function of age. It turns out that very young and very old people have a higher risk than people of middle age. The dataset space_flu.csv is a fake dataset inspired by the previous description. Let's load it:

df_sf = pd.read_csv('../data/space_flu.csv')
age = df_sf.age.values[:, None]
space_flu = df_sf.space_flu

ax = df_sf.plot.scatter('age', 'space_flu', figsize=(8, 5))
ax.set_yticks([0, 1])
ax.set_yticklabels(['healthy', 'sick'])

Figure 7.13

The following model is basically the same as model_iris:

with pm.Model() as model_space_flu:
    ℓ = pm.HalfCauchy('ℓ', 1)
    cov = pm.gp.cov.ExpQuad(1, ℓ) + pm.gp.cov.WhiteNoise(1E-5)
    gp = pm.gp.Latent(cov_func=cov)
    f = gp.prior('f', X=age)
    y_ = pm.Bernoulli('y', p=pm.math.sigmoid(f), observed=space_flu)
    trace_space_flu = pm.sample(
        1000, chains=1, compute_convergence_checks=False)

Now we generate posterior predictive samples for model_space_flu and then plot the results:

X_new = np.linspace(0, 80, 200)[:, None]

with model_space_flu:
    f_pred = gp.conditional('f_pred', X_new)
    pred_samples = pm.sample_posterior_predictive(trace_space_flu,
                                                  vars=[f_pred],
                                                  samples=1000)

_, ax = plt.subplots(figsize=(10, 6))

fp = logistic(pred_samples['f_pred'])
fp_mean = np.nanmean(fp, 0)

ax.scatter(age, np.random.normal(space_flu, 0.02),
           marker='.', color=[f'C{ci}' for ci in space_flu])

ax.plot(X_new[:, 0], fp_mean, 'C2', lw=3)

az.plot_hpd(X_new[:, 0], fp, color='C2')
ax.set_yticks([0, 1])
ax.set_yticklabels(['healthy', 'sick'])
ax.set_xlabel('age')

Figure 7.14

Notice, as illustrated in Figure 7.14, that the GP is able to fit this dataset very well, even when the data demands the function to be more complex than a logistic one. Fitting this data well will be impossible for a simple logistic regression, unless we introduce some ad hoc modifications to help it a little bit (see exercise 6 for a discussion of such modifications).

Table of Contents for Gaussian process classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Gaussian process classification