Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4 Advanced Semi-Supervised Classification

In this chapter, we're going to introduce more advanced semi-supervised classification algorithms that are able to solve complex problems where simpler algorithms fail. In particular, we will discuss:

Contrastive Pessimistic Likelihood Estimation (CPLE)
Semi-Supervised Support Vector Machines (S3VM)
Transductive Support Vector Machines (TSVM)

For each of these algorithms, we'll explain the theory behind them and then show a Python-coded example of them in practice. We'll start with the CPLE Algorithm.

Contrastive Pessimistic Likelihood Estimation

As we discussed in the previous chapter, in many real-life problems, it's cheaper to retrieve unlabeled samples, rather than correctly labeled ones. For this reason, many researchers have worked to find out the best strategies to carry out a semi-supervised classification that could outperform its supervised counterpart. The idea is to train a classifier with a few labeled samples and then improve its accuracy after adding weighted unlabeled samples. One of the best results is the CPLE algorithm, proposed by Loog (in Loog M., Contrastive Pessimistic Likelihood Estimation for Semi-Supervised Classification, arXiv:1503.00269, 2015).

Before we can explain this algorithm, it's necessary to define Platt scaling. If we have a labeled dataset (X, Y) containing N samples, it's possible to define the log-likelihood cost function of a generic estimator, as follows:

After training the model, it should be possible to determine , which is the probability of a label given a sample x_i. However, some classifiers are not based on this approach (such as SVM) and evaluate the right class, for example, by checking the sign of a parametrized function . As CPLE is a generic framework that can be used with any classification algorithm when the probabilities are not available, it's useful to implement a technique called Platt scaling, which allows us to transform the decision function into a probability, through a parametrized sigmoid. For a binary classifier, it can be expressed as follows:

and are parameters that must be learned in order to maximize the likelihood. Luckily, scikit-learn provides the method predict_proba(), which returns the probabilities for all classes. Platt scaling is performed automatically or on demand; for example, the SCV classifier needs to have the parameter probability=True in order to compute the probability mapping. I always recommend checking the documentation before implementing a custom solution.

CPLE Theory

Consider a full dataset, made up of labeled and unlabeled samples. For simplicity, we can reorganize the original dataset, so that the first N samples are labeled, while the next M are unlabeled:

As we don't know the labels for all x^u samples, we can decide to use M k-dimensional (k is the number of classes) soft-labels q_i that can be optimized during the training process:

The second condition in the previous formula is necessary to guarantee that each q_i represents a discrete probability (all the elements must sum up to 1.0). The complete log-likelihood cost function can, therefore, be expressed as follows:

The first term represents the log-likelihood for the supervised part, while the second one is responsible for the unlabeled points. If we train a classifier with only the labeled samples, excluding the second addend, we get a parameter set . CPLE defines a contrastive condition (as a log-likelihood too), by defining the improvement in the total cost function given by the semi-supervised approach, compared to the supervised solution:

This allows us to impose the condition that the semi-supervised solution must outperform the supervised one, in fact, maximizing it; we both increase the first term and reduce the second one, obtaining a proportional increase of CL (the term contrastive is very common in machine learning and it normally indicates a condition that is achieved as the difference between two opposite constraints). If CL doesn't increase, it probably means that the unlabeled samples have not been drawn from the marginal distribution p(x) extracted from p_data.

In the previous expression, we implicitly used soft-labels, but since they're initially randomly chosen and there's no ground truth to support their values, it's a good idea not to trust them. We do that by imposing a pessimistic condition (as another log-likelihood):

By imposing this constraint, we try to find the soft-labels that minimize the contrastive log-likelihood; that's why this is defined as a pessimistic approach. It can seem a contradiction; however, trusting soft-labels can be dangerous, because the semi-supervised log-likelihood could be increased even with a large percentage of misclassification. Our goal is to find the best parameter set (the parameter set that guarantees the highest accuracy starting from the supervised baseline, which has been obtained using the labeled samples) and improve it, without forgetting the structural features provided by the labeled samples.

Therefore, our final goal can be expressed as follows:

At this point, we can create a complete Python example to show the practical abilities of this algorithm.

Example of contrastive pessimistic likelihood estimation

We're going to implement the CPLE algorithm in Python, using a subset extracted from the MNIST dataset. For simplicity, we're going to use only the samples representing the digits 0 and 1:

from sklearn.datasets import load_digits
import numpy as np
X_a, Y_a = load_digits(return_X_y=True)
X = np.vstack((X_a[Y_a == 0], X_a[Y_a == 1]))
Y = np.vstack((np.expand_dims(Y_a, axis=1)[Y_a==0], np.expand_dims(Y_a, axis=1)[Y_a==1]))
nb_samples = X.shape[0]
nb_dimensions = X.shape[1]
nb_unlabeled = 150
Y_true = np.zeros((nb_unlabeled,))
unlabeled_idx = np.random.choice(np.arange(0, nb_samples, 1), replace=False, size=nb_unlabeled)
Y_true = Y[unlabeled_idx].copy()
Y[unlabeled_idx] = -1

After creating the restricted dataset (X, Y) which contain 360 samples, we randomly select 150 samples (about 42%) to become unlabeled (the corresponding y is -1). At this point, we can measure the performance of logistic regression trained only on the labeled dataset:

from sklearn.linear_model import LogisticRegression
lr_test = LogisticRegression(solver="lbfgs", max_iter=10000, multi_class="auto", n_jobs=-1,(), random_state=1000)
lr_test.fit(X[Y.squeeze() != -1], Y[Y.squeeze() != -1].squeeze())
unlabeled_score = lr_test.score(X[Y.squeeze() == -1], Y_true)
print(unlabeled_score)

The output is:

0.573333333333

So, the logistic regression shows 57% accuracy for the classification of the unlabeled samples. We can also evaluate the cross-validation score on the whole dataset (before removing some random labels):

from sklearn.model_selection import cross_val_score
total_cv_scores = cross_val_score(LogisticRegression(solver="lbfgs", max_iter=10000, multi_class="auto", random_state=1000), 
  X, Y.squeeze(), cv=10, n_jobs=-1)
print(total_cv_scores)

The output of the previous snippet is:

[0.41666667 0.58333333 0.63888889 0.19444444 0.44444444 0.27777778
 0.44444444 0.38888889 0.5        0.41666667]

Thus, the classifier achieves an average 43% accuracy when using 10 folds (each test set contains 36 samples) if all the labels are known.

We can now implement a CPLE algorithm. The first thing is to initialize a LogisticRegression instance and the soft-labels:

lr = LogisticRegression(solver="lbfgs", max_iter=10000, multi_class="auto", random_state=1000)
q0 = np.random.uniform(0, 1, size=nb_unlabeled)

q0 is a random array of values bounded in the half-open interval (0, 1), therefore, we also need a converter to transform q_i into an actual binary label:

We can achieve this using the NumPy function np.vectorize(), which allows us to apply a transformation to all the elements of a vector:

trh = np.vectorize(lambda x: 0.0 if x < 0.5 else 1.0)

In order to compute the log-likelihood, we also need a weighted log-loss (similar to the scikit-learn function log_loss(), which, however, computes the negative log-likelihood but doesn't support weights):

def weighted_log_loss(yt, p, w=None, eps=1e-15):
    if w is None:
        w_t = np.ones((yt.shape[0], 2))
    else:
        w_t = np.vstack((w, 1.0 - w)).T
    
    Y_t = np.vstack((1.0 - yt.squeeze(), yt.squeeze())).T
    L_t = np.sum(w_t * Y_t * np.log(np.clip(p, eps, 1.0 - eps)), axis=1)
    
    return np.mean(L_t)

This function computes the following expression:

We also need a function to build the dataset with variable soft-labels q_i:

def build_dataset(q):
    Y_unlabeled = trh(q)
    
    X_n = np.zeros((nb_samples, nb_dimensions))
    X_n[0:nb_samples - nb_unlabeled] = X[Y.squeeze()!=-1]
    X_n[nb_samples - nb_unlabeled:] = X[Y.squeeze()==-1]
    
    Y_n = np.zeros((nb_samples, 1))
    Y_n[0:nb_samples - nb_unlabeled] = Y[Y.squeeze()!=-1]
    Y_n[nb_samples - nb_unlabeled:] = np.expand_dims(Y_unlabeled, axis=1)
    
    return X_n, Y_n

At this point, we can define our contrastive log-likelihood:

def log_likelihood(q):
    X_n, Y_n = build_dataset(q)
    Y_soft = trh(q)
    
    lr.fit(X_n, Y_n.squeeze())
    
    p_sup = lr.predict_proba(X[Y.squeeze() != -1])
    p_semi = lr.predict_proba(X[Y.squeeze() == -1])
    
    l_sup = weighted_log_loss(Y[Y.squeeze() != -1], p_sup)
    l_semi = weighted_log_loss(Y_soft, p_semi, q)
    
    return l_semi - l_sup

This method will be called by the optimizer, passing a different q vector each time. The first step is building the new dataset and computing Y_soft, which are the labels corresponding to q. Then, the logistic regression classifier is trained with the dataset (as Y_n is a (k, 1) array, it's necessary to squeeze it to avoid a warning; the same thing is done when using Y as a Boolean indicator). At this point, it's possible to compute both p_sup and p_sem using the method predict_proba(), and finally, we can compute the semi-supervised and supervised log-loss, a function of q_i, which is the term that we want to minimize, while the maximization of is done implicitly when training the logistic regression.

The optimization is carried out using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm implemented in SciPy:

from scipy.optimize import fmin_bfgs
q_end = fmin_bfgs(f=log_likelihood, x0=q0, maxiter=1000, disp=False)

This is a very fast algorithm, but the user is encouraged to experiment with methods or libraries. The two parameters we need in this case are f, which is the function to minimize, and x0, which is the initial condition for the independent variables. Maxiter is helpful for avoiding an excessive number of iterations when no improvements are achieved (considering the non-convex nature of the problem, this can be very frequent). Once the optimization is complete, q_end contains the optimal soft-labels. We can, therefore, rebuild our dataset:

X_n, Y_n = build_dataset(q_end)

With this final configuration, we can retrain the logistic regression and check the cross-validation accuracy:

final_semi_cv_scores = cross_val_score(
        LogisticRegression(solver="lbfgs", max_iter=10000, multi_class="auto", random_state=1000),
        X_n, Y_n.squeeze(), cv=10, n_jobs=-1)
print(final_semi_cv_scores)

The output of the CPLE cross-validation is:

[0.97297297 0.86486486 0.94594595 0.86486486 0.89189189 0.88571429
 0.48571429 0.91428571 0.88571429 0.48571429]

The semi-supervised solution based on the CPLE algorithms achieves an average 81% accuracy, outperforming, as expected, the supervised approach.

CPLE Summary

CPLE is able to outperform standard classification methods with a limited computational cost, which can be relatively larger due to the re-evaluation of the log-likelihood by the optimization function. However, extra complexity is a normal condition in semi-supervised learning and, at this point, it should be clear when this cost is reasonable and when it's preferable to stick to a smaller, labeled, dataset. The reader can try other examples using different classifiers, such as SVM or Decision Trees, and verify when CPLE allows obtaining higher accuracy than other supervised algorithms.

Semi-supervised Support Vector Machines (S³VM)

When we discussed the cluster assumption in the previous chapter, we also defined the low-density regions as boundaries and the corresponding problem as low-density separation. A common supervised classifier based on this concept is a Support Vector Machine (SVM), the objective of which is to maximize the distance between the dense regions where the samples must be.

S³VM Theory

For a complete description of linear and kernel-based SVMs, please refer to Bonaccorso G., Machine Learning Algorithms, Second Edition, Packt Publishing, 2018. However, it's useful to remind yourself of the basic model for a linear SVM with slack variables :

This model is based on the assumption that y_i can be either -1 or 1. The slack variables or soft-margins are variables, one for each sample, introduced to reduce the strength imposed by the original condition (min ||w||), which is based on a hard margin that misclassifies all the samples that are on the wrong side. They are defined by the Hinge loss, as follows:

With those variables, we allow some points to overcome the limit without being misclassified if they remain within a distance controlled by the corresponding slack variable (which is also minimized during the training phase, so as to avoid uncontrollable growth). In the following diagram, there's a schematic representation of this process:

Linear SVM classification scenario

The last elements of each high-density region are the support vectors. Between them, there's a low-density region (it can also be zero-density in some cases) where our separating hyperplane lies.

In Chapter 1, Machine Learning Model Fundamentals, we defined the concept of empirical risk as a proxy for expected risk; therefore, we can turn the SVM problem into the minimization of empirical risk, under the Hinge cost function (with or without Ridge Regularization on w):

Theoretically, every function that is always bounded by two hyperplanes containing the support vectors is a good classifier, but we need to minimize the empirical risk (and so, the expected risk); hence, we look for the maximum margin between high-density regions. This model can separate two dense regions with irregular boundaries; by adopting a kernel function, it can also work in non-linear scenarios. The natural question, at this point, is about the best strategy to integrate labeled and unlabeled samples when we need to solve this kind of problem in a semi-supervised scenario.

The first element to consider is the ratio: if we have a low percentage of unlabeled points, the problem is mainly supervised, and the generalization ability learned using the training set should be enough to correctly classify all the unlabeled points. On the other hand, if the number of unlabeled samples is much larger, we return to an almost pure clustering scenario (like the one discussed in the paragraph about the Generative Gaussian mixtures). That means that in order to exploit the strength of semi-supervised methods in low-density separation problems, we should consider situations where the ratio labeled/unlabeled is about 1.0.

However, even if we have the predominance of a class (for example, if we have a huge unlabeled dataset and only a few labeled samples), it's always possible to use the algorithms we're going to discuss, even if their performance can sometimes be equal to or lower than a pure supervised/clustering solution. Transductive SMVs, for example, show better accuracies when the labeled/unlabeled ratio is very small, while other methods can behave in a completely different way. When working with semi-supervised learning (and its assumptions), it's always important to bear in mind that each problem is supervised and unsupervised at the same time; the best solution must be evaluated in every different context.

A solution for this problem is offered by the Semi-Supervised SVM (also known as S³VM) algorithm. If we have N labeled samples and M unlabeled samples, the objective function becomes as follows:

The first term imposes the standard SVM condition about the maximum separation distance, while the second block is divided into two parts:

We need to add N slack variables to guarantee a soft-margin for the labeled samples.
At the same time, we have to consider the unlabeled points, which could be classified as +1 or -1. Therefore, we have two corresponding slack-variable sets and . However, we want to find the smallest variable for each possible pair, so as to be sure that the unlabeled sample is placed on the sub-space where the maximum accuracy is achieved.

The constraints necessary to solve the problems become as follows:

The first constraint is limited to the labeled points and it's the same as a supervised SVM. The following two, instead, consider the possibility that an unlabeled sample could be classified as +1 or -1. Let's suppose, for example, that the label y_j for the sample x_j should be +1 and the first member of the second inequality is a positive number K (so the corresponding term of the third equation is -K). It's easy to verify that the first slack variable is , while the second one is z_j ≥ 1 + K. Therefore, in the objective, is chosen to be minimized.

The choice of the hyperparameter C should be based on the same criteria employed for standard SVM. In particular, when C → 0, the number of support vectors is reduced toward the minimum, while larger values (for example, C=1) imply an increased flexibility of the boundaries. Even if this model is structurally an SVM, the presence of unlabeled samples influences the optimization process, yielding very different results when, for example, C is increased from 0 to 1.

The reader must keep in mind that the algorithm will try to find the smallest slack values for the unlabeled samples, but since there are no guiding labels, the final choice could be incompatible with the problem. Considering both cluster and smoothness assumptions, the unlabeled samples should acquire the label of the closest coherent dense region. As it's possible to see by changing the value of C, this is not always true, because, in order to minimize the slack error, the algorithm could assign a label geometrically incompatible with the sample.

For example, the closest region to a point next to the boundary might be a cluster with fixed labels. In this case, we'd expect this point to have the same label.

However, if the objective is minimized by assigning a different label, the algorithm will make an inappropriate decision, because there are no penalties for not respecting the cluster assumption. Therefore, as a generic rule, we suggest starting with a larger value (for example, C=1) and reducing it only after having compared the results with an analogous supervised classifier. In most cases, the optimal value can be found with a few iterations, when a performance measure (for example, AUC, F1-score, or accuracy) of the S³VM overcomes the corresponding measure of the benchmark method.

This algorithm is inductive and generally yields good (if not excellent) performances. However, it has a very high computational cost, and should be solved using optimized (native) libraries. Unfortunately, it's a non-convex problem, and there are no methods to solve it in closed form; this also implies that the optimization might not reach the optimal configuration.

Let's now implement an S³VM in Python and evaluate the results.

Example of S³VM

We'll now implement an S³VM in Python using the SciPy optimization methods, which are mainly based on C and FORTRAN implementations. The reader can try it with other libraries, such as NLOpt and LIBSVM, and compare the results.

NLopt is a complete optimization library developed at MIT. It is available for different operating systems and programming languages. The website is https://nlopt.readthedocs.io.

LIBSVM is an optimized library for solving SVM problems and it is adopted by scikit-learn together with LIBLINEAR. It's also available for different environments. The home page is https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

A possibility suggested by Bennet and Demiriz is to use the L1-norm for w, so as to linearize the objective function; however, this choice only seems to produce good results for small datasets. We're going to keep the original formulation based on the L2-norm, using a Constrained optimization by linear approximation (COBYLA) algorithm to optimize the objective.

Let's start by creating a bidimensional dataset, with both labeled and unlabeled samples (with 50% of each):

from sklearn.datasets import make_classification
nb_samples = 100
nb_unlabeled = 50
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, random_state=1000)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

For simplicity (and without any impact, because the samples are shuffled), we set the last 50 samples as unlabeled (y = 0). The corresponding plot is shown in the following graph:

Original labeled and unlabeled dataset

The empty circles represent unlabeled points, which are spread throughout the entire dataset, while full circles and triangle markers are assigned respectively to class 0 (y=-1) and 1 (y=1). At this point, we need to initialize all variables required for the optimization problem:

import numpy as np
w = np.random.uniform(-0.1, 0.1, size=X.shape[1])
eta = np.random.uniform(0.0, 0.1, size=nb_samples - nb_unlabeled)
xi = np.random.uniform(0.0, 0.1, size=nb_unlabeled)
zi = np.random.uniform(0.0, 0.1, size=nb_unlabeled)
b = np.random.uniform(-0.1, 0.1, size=1)
C = 1.0
theta0 = np.hstack((w, eta, xi, zi, b))

Since the optimization algorithm requires a single array, we've stacked all vectors into a horizontal array theta0 using the np.hstack() function. We also need to vectorize the min() function in order to apply it to arrays:

vmin = np.vectorize(lambda x1, x2: x1 if x1 <= x2 else x2)

Now, we can define the objective function:

def svm_target(theta, Xd, Yd):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    s_eta = np.sum(theta[2:2 + nb_samples - nb_unlabeled])
    s_min_xi_zi = np.sum(vmin(theta[2 + nb_samples - nb_unlabeled:2 + nb_samples], 
                                      theta[2 + nb_samples:2 
                                      + nb_samples + nb_unlabeled]))
    return C * (s_eta + s_min_xi_zi) + 0.5 * np.dot(wt.T, wt)

The arguments are the current theta vector, and the complete datasets Xd and Yd. The dot product of w has been multiplied by 0.5 to keep the conventional notation used for supervised SVMs. The constant can be omitted without any impact. At this point, we need to define all the constraints, since they are based on the slack variables; each function (which shares the same parameters of the objectives) is parametrized with an index, idx. The labeled constraint is as follows:

def labeled_constraint(theta, Xd, Yd, idx):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    c = Yd[idx] * (np.dot(Xd[idx], wt) + theta[-1]) + 
    theta[2:2 + nb_samples - nb_unlabeled][idx] - 1.0
    
    return (c >= 0)[0]

The unlabeled constraints, instead, are as follows:

def unlabeled_constraint_1(theta, Xd, idx):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    c = np.dot(Xd[idx], wt) - theta[-1] + 
        theta[2 + nb_samples - nb_unlabeled:2 + nb_samples][idx - nb_samples + nb_unlabeled] - 1.0
        
    return (c >= 0)[0]
def unlabeled_constraint_2(theta, Xd, idx):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    c = -(np.dot(Xd[idx], wt) - theta[-1]) + 
        theta[2 + nb_samples:2 + nb_samples + nb_unlabeled ][idx - nb_samples + nb_unlabeled] - 1.0
        
    return (c >= 0)[0]

They're parametrized with the current theta vector, the Xd dataset, and an idx index. We also need to include the constraints for each slack variable (≥ 0):

def eta_constraint(theta, idx):
    return theta[2:2 + nb_samples - nb_unlabeled][idx] >= 0
def xi_constraint(theta, idx):
    return theta[2 + nb_samples - nb_unlabeled:2 + nb_samples][idx - nb_samples + nb_unlabeled] >= 0
def zi_constraint(theta, idx):
    return theta[2 + nb_samples:2 + nb_samples+nb_unlabeled ][idx - nb_samples + nb_unlabeled] >= 0

We can now set up the problem using the SciPy convention:

svm_constraints = []
for i in range(nb_samples - nb_unlabeled):
    svm_constraints.append({
            'type': 'ineq',
            'fun': labeled_constraint,
            'args': (X, Y, i)
        })
    svm_constraints.append({
            'type': 'ineq',
            'fun': eta_constraint,
            'args': (i,)
        })
    
for i in range(nb_samples - nb_unlabeled, nb_samples):
    svm_constraints.append({
            'type': 'ineq',
            'fun': unlabeled_constraint_1,
            'args': (X, i)
        })
    svm_constraints.append({
            'type': 'ineq',
            'fun': unlabeled_constraint_2,
            'args': (X, i)
        })
    svm_constraints.append({
            'type': 'ineq',
            'fun': xi_constraint,
            'args': (i,)
        })
    svm_constraints.append({
            'type': 'ineq',
            'fun': zi_constraint,
            'args': (i,)
        })

Each constraint is represented by a dictionary, where type is set to ineq to indicate that it is an inequality, fun points to the callable object, and args contains all extra arguments (theta is the main x variable and it's automatically added). Using SciPy, it's possible to minimize the objective using either the Sequential Least Squares Programming (SLSQP) or Constraint Optimization by Linear Approximation (COBYLA) algorithms. We've chosen the latter, but the reader is free to employ any other method or library. In order to limit the number of iterations, we've also set the optional dictionary parameter 'maxiter': 5000:

from scipy.optimize import minimize
result = minimize(fun=svm_target, 
                  x0=theta0, 
                  constraints=svm_constraints, 
                  args=(X, Y), 
                  method='COBYLA', 
                  tol=0.0001, 
                  options={'maxiter': 5000})

After the training process is complete, we can compute the labels for the unlabeled points:

theta_end = result['x']
w = theta_end[0:2]
b = theta_end[-1]
Xu= X[nb_samples - nb_unlabeled:nb_samples]
yu = -np.sign(np.dot(Xu, w) + b)

In the next graph, it's possible to compare the initial plot (left) with the final one where all points have been assigned a label (right):

Original dataset (left). Training results (right)

As you can see, S³VM succeeded in finding the right label for most unlabeled points and it's also rather resistant to the noise induced by spurious points. Unfortunately, the problem is not linearly separable with the maximum accuracy and the final configuration tends to penalize the class 1 samples below a diagonal separation line (the dominant cluster is the stretched one in the upper-right quadrant). However, the result is promising and worth further investigations using different hyperparameters and optimization algorithms (COBYLA tends to remain stuck in local minima). I invite the reader to go on with such experiments to gain confidence in the method.

S³VM Summary

S³VM is a very powerful approach that offers high flexibility to adapt to different scenarios. It's particularly suitable when the structure of the unlabeled sample is partially (or even completely) unknown, and the main responsibility for labeling must be granted to the labeled samples.

Transductive Support Vector Machines (TSVM)

Another approach to the same problem is offered by Transductive Support Vector Machines (TSVM), proposed by T. Joachims (in Joachims T., Transductive Inference for Text Classification using Support Vector Machines, ICML Vol. 99/1999). TSVM are particularly suited when the unlabeled sample isn't very noisy, and the overall structure of the dataset is trustworthy. A common application of TSVM is classification on a dataset containing data points drawn from the same data-generating process (for example, medical photos collected using the same instrument) but only partially labeled due to, for example, economic reasons. Since all the images can be trusted, TSVM can exploit the structure of the dataset to achieve an accuracy larger than the one reachable by a supervised classifier.

TSVM Theory

The idea is to keep the original objective with two sets of slack variables – the first for the labeled samples and the other for the unlabeled ones:

Since this is a transductive approach, we need to consider the unlabeled samples as variable-labeled ones (subject to the learning process), imposing a constraint similar to the supervised points. From a certain viewpoint, this is equivalent to introducing a prior belief about the final classification, strongly based on both cluster and smoothness assumptions.

In other words, a TSVM can trust the structure of the dataset more than a S³VM, and the data scientist has more flexibility in the choice of behavior. Different combinations of C_L and C_U yield results that can shift from complete trust being granted to the labeled points to the opposite condition. As explained in the introduction, the goal of transductive learning is only to classify the unlabeled samples, leveraging both the labeled ones and the structure of the dataset to do so. However, contrary to inductive methods, the constraints imposed by the labeled samples can be weakened in favor of a more geometrically coherent solution.

As for the previous algorithm, we assume we have N labeled samples and M unlabeled ones; therefore, the conditions become as follows:

The first constraint is the classical SVM one and it works only on labeled samples. The second one uses the variable with the corresponding slack variables to impose a similar condition on the unlabeled samples, while the third one is necessary to constrain the labels to be equal to -1 and 1.

Just like the semi-supervised SVMs, this algorithm is non-convex, and it's useful to try different methods to optimize it. Moreover, the author, in the aforementioned paper, showed how TSVM outperforms a standard supervised SVM when the test set (unlabeled) is large, and the training set (labeled) is relatively small. On the other hand, with large training sets and small test sets, a supervised SVM (or other algorithms) are always preferable, because they are faster and yield better accuracy.

We can now create a complete Python example of TSVM and evaluate the results.

Example of TSVM

In our Python implementation, we're going to use a bidimensional dataset similar to the one employed in the previous method. However, in this case, we'll impose 150 unlabeled samples out of a total of 200 points:

from sklearn.datasets import make_classification
nb_samples = 200
nb_unlabeled = 150
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, random_state=1000)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

The corresponding plot is shown in the following figure:

Original labeled and unlabeled dataset

The procedure's similar to the one we used before. First of all, we need to initialize our variables:

import numpy as np
w = np.random.uniform(-0.1, 0.1, size=X.shape[1])
eta_labeled = np.random.uniform(0.0, 0.1, size=nb_samples - nb_unlabeled)
eta_unlabeled = np.random.uniform(0.0, 0.1, size=nb_unlabeled)
y_unlabeled = np.random.uniform(-1.0, 1.0, size=nb_unlabeled)
b = np.random.uniform(-0.1, 0.1, size=1)
C_labeled = 2.0
C_unlabeled = 0.1
theta0 = np.hstack((w, eta_labeled, eta_unlabeled, y_unlabeled, b))

In this case, we also need to define the y_unlabeled vector for variable labels. I also suggest using two C constants (C_labeled and C_unlabeled), in order to be able to weight the misclassification of labeled and unlabeled samples differently. We used a value of 2.0 for C_labeled and 0.1 for C_unlabled, because we want to accept the guidance of the labeled samples more than the structure of the unlabeled ones. In a further example, we'll compare the results with an opposite scenario.

The objective function to optimize is as follows:

def svm_target(theta, Xd, Yd):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    s_eta_labeled = np.sum(theta[2:2 + nb_samples - nb_unlabeled])
    s_eta_unlabeled = np.sum(theta[2 + nb_samples - nb_unlabeled:2 + nb_samples])
    
    return (C_labeled * s_eta_labeled) + (C_unlabeled * s_eta_unlabeled) + (0.5 * np.dot(wt.T, wt))

While the labeled and unlabeled constraints are as follows:

def labeled_constraint(theta, Xd, Yd, idx):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    c = Yd[idx] * (np.dot(Xd[idx], wt) + theta[-1]) + 
    theta[2:2 + nb_samples - nb_unlabeled][idx] - 1.0
    
    return int((c >= 0)[0])
def unlabeled_constraint(theta, Xd, idx):
    wt = theta[0:2].reshape((Xd.shape[1], 1))
    
    c = theta[2 + nb_samples:2 + nb_samples + nb_unlabeled][idx - nb_samples + nb_unlabeled] * 
        (np.dot(Xd[idx], wt) + theta[-1]) + 
        theta[2 + nb_samples - nb_unlabeled:2 + nb_samples][idx - nb_samples + nb_unlabeled] - 1.0
    
    return int((c >= 0)[0])

In this example, we want to employ the SLSQP algorithm to optimize the objective. This method computes the Jacobian (that is, the matrix containing the first partial derivatives) of all constraints (including the Boolean ones) and in NumPy 1.8+ the difference operator (-) between Boolean arrays has been deprecated and must be replaced with a logical XOR.

Unfortunately, this can cause incompatibilities with SciPy; since that's the case, we've transformed all Boolean outputs into integer values (0 and 1). This substitution doesn't affect either the performance or the final result. At this point, we can introduce the constraints for both labeled and unlabeled samples:

def eta_labeled_constraint(theta, idx):
    return int(theta[2:2 + nb_samples - nb_unlabeled][idx] >= 0)
def eta_unlabeled_constraint(theta, idx):
    return int(theta[2 + nb_samples - nb_unlabeled:2 + nb_samples][idx - nb_samples + nb_unlabeled] >= 0)

As in the previous example, we can create the constraint dictionary needed by SciPy:

svm_constraints = []
for i in range(nb_samples - nb_unlabeled):
    svm_constraints.append({
            'type': 'ineq',
            'fun': labeled_constraint,
            'args': (X, Y, i)
        })
    svm_constraints.append({
            'type': 'ineq',
            'fun': eta_labeled_constraint,
            'args': (i,)
        })
    
for i in range(nb_samples - nb_unlabeled, nb_samples):
    svm_constraints.append({
            'type': 'ineq',
            'fun': unlabeled_constraint,
            'args': (X, i)
        })
    svm_constraints.append({
            'type': 'ineq',
            'fun': eta_unlabeled_constraint,
            'args': (i,)
        })

After having defined all the constraints, we can minimize the objective function using method='SLSQP' and the dictionary option 'maxiter': 2000. In general, convergence is achieved in a smaller number of iterations, but here we've made assumptions as though we're working in a more general scenario:

from scipy.optimize import minimize
result = minimize(fun=svm_target, 
                  x0=theta0, 
                  constraints=svm_constraints, 
                  args=(X, Y), 
                  method='SLSQP', 
                  tol=0.0001, 
                  options={'maxiter': 2000})
print(result['message'])

The output of the previous snippet is:

Optimization terminated successfully.

Such a message confirms that SLSQP successfully found a minimum. I always check the output of the optimization function, to make sure that nothing went wrong during the procedure, and I recommend that you do too. In particular, when using methods like COBYLA, it's important that all constraints are differentiable. When some of them are not, the algorithm can stop working properly, because the approximations of the Jacobian become unreliable.

When the process is complete, we can compute the labels for the unlabeled samples and compare the plots:

theta_end = result['x']
w = theta_end[0:2]
b = theta_end[-1]
Xu= X[nb_samples - nb_unlabeled:nb_samples]
yu = -np.sign(np.dot(Xu, w) + b)

The plot comparison is shown in the following figure:

Original dataset (left). Final labeled dataset (right)

The misclassification (based on the density distribution) is slightly lower than S³VM, because we have decided to trust the labeled samples more. Of course, it's possible to change the C values and the optimization method until the expected result has been reached. A good benchmark is provided by a supervised SVM, which can have better performance when the training set is huge enough (and when it represents the whole p_data correctly). However, considering the clusters in the original dataset, we can conclude that TSVM achieved the expected result.

In particular, if we consider the separation region, we can observe that a slight modification of the slope will force the class 0 dense blob located at to be, at least partially, assigned to the label 1. The presence of separate slack variable sets, contrary to S³VM, has allowed the smoothness of the separation boundary to increase, with a final result similar to the ones achievable with non-linear methods.

Analysis of different TSVM configurations

It's interesting to evaluate different combinations of the C parameters, starting from a standard supervised linear SVM. The dataset is smaller, with a large number of unlabeled samples:

nb_samples = 100
nb_unlabeled = 90
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, random_state=100)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

We use the standard SVM implementation provided by scikit-learn (the SVC() class) with a linear kernel and C=1.0:

from sklearn.svm import SVC
svc = SVC(kernel='linear', C=1.0)
svc.fit(X[Y != 0], Y[Y != 0])
Xu_svc= X[nb_samples - nb_unlabeled:nb_samples]
yu_svc = svc.predict(Xu_svc)

The SVM is trained with the labeled samples, and the vector yu_svc contains the prediction for the unlabeled samples. The resulting plot (in comparison with the original dataset) is shown in the following figure:

Original dataset (left). Final labeled dataset (right) with C = 1.0

All the labeled samples are represented with bigger squares and circles. The result meets our expectations, but there's an area (X [-1, 0] - Y [-2, -1]), where the SVM decided to impose the circle class (class 0) even if the unlabeled points are close to a square. This hypothesis might not be always acceptable considering the clustering assumption.

In fact, in a high-density region, there are samples belonging to two classes. A similar result is obtained using a TSVM with C_L=1 and C_U=10.0 (the reader can easily check it as an exercise):

Original dataset (left). Final labeled dataset (right) with C_L = 1 and C_U = 10

In this case, we are trusting the labeled samples to determine the final result. As it's possible to see, the boundary position is slightly worse than SVM. For example, the class 1 points in the region are aligned with a sequence of other points. We can consider the structure as a very long blob (with a strong directional cohesion). The standard SVM decided to split it into two regions with a separation line passing through the central empty area. This might be the most logical choice, considering the overall dataset.

On the other hand, using the TSVM, we have granted more responsibility to the labeled sample (that is, the unlabeled slack variables can have a larger variation so the configuration can more easily include the constraints imposed by the labeled sample) and such a choice led the algorithm to reduce the slope of the separation line. The result, although similar to the SVM one, has more cluster-coherence and the two classes are also more balanced. Remember that, in a semi-supervised scenario, we don't have any external guarantee about the class-balance, and it can only be obtained through the geometry of the dataset. In our case, we started with a balanced dataset; however, if it were potentially unbalanced, the only resources that a TSVM has in order to achieve this objective are the cluster and smoothness assumptions. By granting a larger weight to the unlabeled samples, we enforce this condition implicitly. In other words, we instruct the model to find the most appropriate labels according to the clustering structure of the dataset, and to minimize all abrupt transitions.

Let's try now to invert the responsibilities by setting C_L=10 and C_U=0.1:

Original dataset (left). Final labeled dataset (right) with C_L = 10 and C_U = 0.1

In this case, the labeled samples are allowed to exploit more flexible margins, while the unlabeled ones are restricted to an almost hard margin. The separation line is now diagonal, and the misclassification error is lower or comparable to the previous configuration. However, in this case, we can fully appreciate the effectiveness of TSVMs. In fact, the result shows a more coherent geometrical separation. All the clusters are compact, and the slack variables for the unlabeled samples are kept smaller, while the labeled samples are allowed to be more flexible. This condition determines an increased ability to adapt the final configuration to the geometrical structure of the dataset.

Of course, the reader may wonder which of the two results is correct. As we are working in a semi-supervised scenario, to fully answer this question, it's necessary to have access to a larger labeled dataset, which is often impossible. My suggestion is to test several configurations (on sub-sampled datasets), before making the final decision. Whenever the labeled points must be trusted not only because they have been correctly pre-classified, but also because their position is geometrically strategical (for example, they are centroids of specific populations), the choice should be to pick a configuration similar to a standard SVM.

On the contrary, when the structure of the dataset is definitely more representative of the underlying data-generating process, a TSVM can grant more responsibility to the unlabeled sample and exploit the labeled one to find a reasonable classification that is also compatible with the position of the dense regions. Finally, it's also important to remember that the smoothness assumption is based on the idea that abrupt changes are very rare. Comparing the two previous examples, the reader will notice that, in the second configuration, the separation boundary is a little bit smoother than the first one.

In fact, while in the first configuration there are at least two abrupt class changes and all the central empty region is subject to the same behavior, the second one seems to transition from one class to another in a more flexible and smoother way, with only an abrupt (inevitable) separation. In this case, we can conclude that the second configuration is preferable.

In Chapelle O., Schölkopf B., Zien A. (edited by), Semi-Supervised Learning, The MIT Press, 2010, there are further details about possible optimization strategies, with strengths and weaknesses. I invite the reader to check the book if they're interested in studying other, more complex, problems and solution strategies.

TSVM Summary

TSVMs are powerful semi-supervised models that are particularly suited to scenarios where the geometrical structure of the dataset is trustworthy and all the points are drawn from the same data-generating process. If these conditions are met, the algorithm can leverage the structure of the dataset to find the most appropriate labeling for the unlabeled sample. On the other hand, if the unlabeled sample is noisy, or its structure can derive from multiple processes, TSVM is not an appropriate choice and might yield very inaccurate results.

Summary

The algorithms discussed in this chapter are generally more powerful than those analyzed in the previous one, but they have specific differences that must always be considered. CPLE and S³VM are inductive methods.

CPLE is an inductive, semi-supervised classification framework based on statistical learning concepts that can be adopted together with any supervised classifier. The main concept is to define a contrastive log-likelihood based on soft-labels that takes into account both labeled and unlabeled samples. The importance granted to the latter is conditioned to the maximization of the log-likelihood, and therefore the algorithm is less suited to tasks where fine control is needed.

Another inductive classification approach is provided by the S³VM algorithm, which is an extension of the classical SVM approach, based on two extra optimization constraints to address the unlabeled samples. This method is relatively powerful, but it's non-convex and, therefore, very sensitive to the algorithms employed to minimize the objective function. In both the S³VM and CPLE cases, the level of trust granted to the unlabeled sample is always relatively low, and the main responsibility for the classification lies with the labeled sample.

An alternative to S³VM is provided by TSVM, which tries to minimize the objective with a condition based on variable labels. The problem is, hence, divided into two parts: a supervised one, which is exactly the same as standard SVM, and a semi-supervised one, which has a similar structure but without fixed y labels. This problem is non-convex too and it's necessary to evaluate different optimization strategies to find the best trade-off between accuracy and computational complexity. TSVMs transductive approach relies heavily on the structure of the dataset; it's only a reasonably good choice when both labeled and unlabeled samples are known to be drawn from the same data-generating process.

In the Further reading section, there are some useful resources so you can examine all these problems in depth and find a suitable solution for each particular scenario.

In the next chapter, Chapter 5, Graph-Based Semi-Supervised Learning, we'll continue this exploration by discussing some important algorithms based on the structure underlying the dataset. In particular, we're going to employ graph theory to perform the propagation of labels to unlabeled samples, and to reduce the dimensionality of datasets in non-linear contexts.

Table of Contents for Advanced Semi-Supervised Classification

Create new playlist

Sign In

Sign Up

4

Advanced Semi-Supervised Classification