Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5 Graph-Based Semi-Supervised Learning

In this chapter, we continue our discussion about semi-supervised learning, considering a family of algorithms that are based on the graph obtained from a dataset, and the existing relationships among samples. The problems that we are going to discuss belong to two main categories: the propagation of class labels to unlabeled samples, and the use of non-linear techniques based on the manifold assumption to reduce the dimensionality of the original dataset. In particular, this chapter covers the following propagation algorithms:

Label propagation based on the weight matrix
Label propagation in scikit-learn, based on transition probabilities
Label spreading
Laplacian regularization
Propagation based on Markov random walks

For the manifold learning section, we're discussing the following:

The Isomap algorithm and the multidimensional scaling approach
Locally linear embedding
Laplacian spectral embedding
t-distributed Stochastic Neighbor Embedding (t-SNE)

Each algorithm is described mathematically and a coded example is provided. We'll start with label propagation, which covers a family of algorithms that rely on the dataset structure to find the missing labels.

Label propagation

Label propagation is a family of semi-supervised algorithms that rely on the graph representation of a dataset to exploit the existing relations among nodes in order to propagate the labels to unlabeled points. In particular, if we have N labeled points (with bipolar labels +1 and –1) and M unlabeled points (denoted by y = 0), it's possible to build an undirected graph based on a measure of geometric affinity among samples. In the following figure, there's an example of such a structure:

Example of binary graph

The graph is defined as a structure containing two sets G = {V, E}. E is the set of vertices (or nodes) and contains all sample labels V = {–1, +1, 0}, while the edge set E is based on an affinity measure that encodes the strength of the relation between two nodes. For practical reasons, it's helpful to introduce a matrix W whose elements w_ij are:

The actual weight of the edge (i, j). In this case, W is called the affinity matrix, and the elements w_ij are defined through an actual distance measure, so that when w_ij > w_iz it means that the node i has a stronger connection with j rather than with z.
A value from the set K = {0, 1}, where 1 indicates the presence of a connection, and 0 the opposite condition. In this case, W is often called the adjacency matrix and, contrary to the previous example, now W can also be singular when the graph is not fully connected (for example, if the outliers are too far away to be connected to any other nodes, their rows will be null and det W = 0). It's necessary to pay attention to these cases when the algorithms need to invert W, or its derived matrix, which will be singular too.

In the preceding example, there are four labeled points (two with y = +1 and two with y = –1), and two unlabeled points (y = 0). The affinity matrix W is normally symmetric and square with dimensions equal to . It can be obtained with different approaches. The most common ones, also adopted by scikit-learn, are:

k-Nearest Neighbors (we are going to discuss this algorithm in further detail in Chapter 6, Clustering and Unsupervised Models), which yields an adjacency matrix:
Radial basis function kernel, which instead outputs an actual affinity matrix with not-null entries:

Sometimes, in the radial basis function kernel, the parameter is represented as the reciprocal of ; however, small values corresponding to a large variance increase the radius, including farther points and smoothing the class over a number of samples, while large values restrict the boundaries to a subset that tends to a single sample. Instead, in the KNN kernel, the parameter k controls the number of samples to consider as neighbors.

To describe the basic algorithm, we also need to introduce the degree matrix (D):

This is a diagonal matrix where each non-null element represents the degree of the corresponding vertex. This can be the number of incoming edges, or a measure proportional to it (as in the case of W based on the radial basis function). The degree matrix plays a fundamental role in the definition of a particular operator, called graph Laplacian, which is extremely helpful in many of the algorithms that we are going to discuss.

The general idea of label propagation is to let each node propagate its label to its neighbors and iterate the procedure until convergence.

Formally, if we have a dataset containing both labeled and unlabeled samples:

The steps of the main label propagation algorithm (as proposed by Zhu and Ghahramani in Zhu X., Ghahramani Z., Learning from Labeled and Unlabeled Data with Label Propagation, CMU-CALD-02-107, 2002), are:

Select an affinity matrix type (KNN or RBF) and compute W
Compute the degree matrix D
Define
Define
Iterate until the convergence of the following steps:

The first update performs a propagation step with both labeled and unlabeled points. Each label is spread from a node through its outgoing edges, and the corresponding weight, normalized with the degree, increases or decreases the effect of each contribution. The second command instead resets all y values for the labeled samples. The final labels can be obtained with:

The proof of convergence is very easy. If we partition the matrix according to the relationship among labeled and unlabeled samples (identified respectively with the subscripts L and U), we get:

If we consider that only the first N components of Y are non-null and they are clamped at the end of each iteration, the matrix can be rewritten as:

We are interested in proving the convergence for the part regarding the unlabeled samples (the labeled ones are fixed), so we can write the update rule as:

Transforming the recursion into an iterative process, the previous formula becomes:

In the previous expression, the second term is null, so we need to prove that the first term converges; however, it's easy to recognize a truncated matrix geometrical series (Neumann series), which has a limit if the matrix (I – A) is invertible:

In our case, A_UU is constructed to have all eigenvalues ; therefore, (I – A_UU) is invertible and the series converges to:

Therefore, the final labeling is unique (when and depends on both the existing labels and on the unlabeled part of the Laplacian. From a mathematical viewpoint, the labeled part is responsible for incorporating the starting condition in the model, while the unlabeled part has the role of defining the dynamics of the propagation according to the structure of the graph.

Example of label propagation

We can implement the algorithm in Python, using a test bidimensional dataset:

from sklearn.datasets import make_classification
nb_samples = 100
nb_unlabeled = 75
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=1000)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

As in the other examples, we set y = 0 for all unlabeled samples (75 out of 100). The corresponding plot is shown in the following graph:

Partially labeled dataset

The dots marked with a cross are unlabeled. At this point, we can define the affinity matrix. In this case, we compute it using both methods:

from sklearn.neighbors import kneighbors_graph
nb_neighbors = 2
W_knn_sparse = kneighbors_graph(X, n_neighbors=nb_neighbors, mode='connectivity', include_self=True)
W_knn = W_knn_sparse.toarray()

The KNN matrix is obtained using the scikit-learn function kneighbors_graph() with the parameters n_neighbors=2 and mode='connectivity'; the alternative is distance, which returns the distances instead of 0 and 1 to indicate the absence/presence of an edge. The include_self=True parameter is useful, as we want to have W_ii = 1.

For the RBF matrix, we need to define it manually:

import numpy as np
def rbf(x1, x2, gamma=10.0):
    n = np.linalg.norm(x1 - x2, ord=1)
    return np.exp(-gamma * np.power(n, 2))
W_rbf = np.zeros((nb_samples, nb_samples))
for i in range(nb_samples):
    for j in range(nb_samples):
        W_rbf[i, j] = rbf(X[i], X[j])

The default value for is 10, corresponding to a standard deviation equal to 0.22. When using this method, it's important to set a correct value for ; otherwise, the propagation can degenerate in the predominance of a class ( too small). As is analogous to the inverse of the double of the variance of a Gaussian distribution (), its value should be computed by considering the sample standard deviation of the labeled dataset and by assuming that the influence of the function almost vanishes after about from the mean. Therefore, the value chosen for should always be computed considering this reference. A small is equivalent to a large variance; therefore, the affinity matrix will take into account also very far neighbors, while a large might result in an almost empty neighborhood.

Now, we can compute the degree matrix and its inverse. As the procedure is identical, from this point on, we continue using the RBF affinity matrix:

D_rbf = np.diag(np.sum(W_rbf, axis=1))
D_rbf_inv = np.linalg.inv(D_rbf)

The algorithm is implemented using a variable threshold. The value adopted here is 0.01. As the algorithm iterates until , the value for the tolerance should be small enough to guarantee that the sign function applied to the final label vector doesn't make any mistake. A reasonable rule of thumb is to set the tolerance equal to . Smaller values are always preferable, but they can lead to a larger number of iterations.

Ideally, it's possible to set a double stop criterion based on the threshold and a maximum number of iterations:

tolerance = 0.01
Yt = Y.copy()
Y_prev = np.zeros((nb_samples,))
iterations = 0
while np.linalg.norm(Yt - Y_prev, ord=1) > tolerance:
    P = np.dot(D_rbf_inv, W_rbf)
    Y_prev = Yt.copy()
    Yt = np.dot(P, Yt)
    Yt[0:nb_samples - nb_unlabeled] = Y[0:nb_samples - nb_unlabeled]
Y_final = np.sign(Yt)

The final result is shown in the following double plot:

Original dataset (left); dataset after a complete label propagation (right)

As it's possible to see, in the original dataset, there's a round dot surrounded by square ones (-0.9, -1). As this algorithm keeps the original labels, we find the same situation after the propagation of labels. This condition could be acceptable, even if both the smoothness and clustering assumptions are contradicted. Assuming that it's reasonable (that is, the initial labeling can be considered as flexible and there are no explicit constraints), it's possible to force a correction by relaxing the algorithm:

tolerance = 0.01
Yt = Y.copy()
Y_prev = np.zeros((nb_samples,))
iterations = 0
while np.linalg.norm(Yt - Y_prev, ord=1) > tolerance:
    P = np.dot(D_rbf_inv, W_rbf)
    Yt = np.dot(P, Yt)
    Y_prev = Yt.copy()
Y_final = np.sign(Yt)

With this modification, the original labels are not clamped anymore, and the propagation algorithm is allowed to change all those values that disagree with the neighborhood. The result is shown in the following plot:

Original dataset (left); dataset after a complete label propagation with overwrite (right)

As you can see, the labeling process has been completed successfully and the point in the neighborhood centered at (-1, -1) has now been relabeled to match its neighbor's labels. We can now analyze an example using the scikit-learn API, which is slightly different from the algorithm previously discussed.

Label propagation in scikit-learn

Scikit-learn implements a slightly different algorithm proposed by Zhu and Ghahramani (in the aforementioned paper) that is equivalent in terms of results but has a slightly different internal dynamic based on a Markov random walk through the graph until a stationary configuration is found (that is, the labels don't change anymore).

The affinity matrix W can be computed using both methods (KNN and RBF), but it is normalized to become a probability transition matrix:

The algorithm operates like a Markov random walk, with the following sequence (assuming that there are Q different labels):

Define a matrix , where P(label = y_i) is the probability of the label y_i, and each row is normalized so that all the elements sum up to 1
Define
Iterate until convergence of the following steps:

The first update performs a label propagation step. As we're working with probabilities, it's necessary (second step) to renormalize the rows so that their element sums up to 1. The last update resets the original labels for all labeled samples. In this case, it means imposing a P(label = y_i) = 1 to the corresponding label and setting all the others to zero. The proof of convergence is very similar to the one for label propagation algorithms, and can be found in Zhu X., Ghahramani Z., Learning from Labeled and Unlabeled Data with Label Propagation, CMU-CALD-02-107. The most important result is that the solution can be obtained in closed form (without any iteration) through this formula:

The first term is the sum of a generalized geometric series, where P_UU is the unlabeled-unlabeled part of the transition matrix P. P_UL, instead, is the unlabeled-labeled part of the same matrix.

For our Python example, we need to build the dataset differently, because scikit-learn considers a sample unlabeled if y = -1:

from sklearn.datasets import make_classification
nb_samples = 1000
nb_unlabeled = 750
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=100)
Y[nb_samples - nb_unlabeled:nb_samples] = -1

We can now train a LabelPropagation instance with an RBF kernel and gamma=10.0:

from sklearn.semi_supervised import LabelPropagation
lp = LabelPropagation(kernel='rbf', gamma=10.0)
lp.fit(X, Y)
Y_final = lp.predict(X)

The result is shown in the following double plot:

Original dataset (left). Dataset after a scikit-learn label propagation (right)

As expected, the propagation converged to a solution that respects both the smoothness and the clustering assumptions. In fact, as it's possible to see, the labels are assigned considering the clustering structure of the dataset (that is, the blobs are always coherent) and the transitions are relatively smooth (excluding the ones imposed by the labeled sample).

Label spreading

Another algorithm (proposed by Zhou et al.) that we need to analyze is called label spreading, which offers a slight better stability when the dataset is very noisy or dense. In these cases, standard label propagation might suffer a loss of precision due to the closeness of points with different labels. Conversely, label spreading is more robust because the Laplacian is normalized and abrupt transitions are more heavily penalized (all mathematical details are quite complex, but the reader can find all details in Biyikoglu T., Leydold J., Stadler P. F., Laplacian Eigenvectors of Graphs, Springer, 2007).

The algorithm is based on the normalized graph Laplacian, defined as:

Considering it in matrix form, it has a diagonal element equal to 1, if the degree (0 otherwise), and all the other elements equal to:

This operator is a particular case of a generic graph Laplacian:

The behavior of such an operator is analogous to a discrete Laplacian operator, whose real-value version is the fundamental element of all diffusion equations. Let's suppose we have a small portion of a graph, as shown in the following figure:

Portion of graph containing two nodes

If there are no other connections that involve either the node a or b, the degrees are: deg a = 4 and deg b = 1. Let's now consider the Laplacian value for the edge (a, b). As the nodes are connected, and D is diagonal, , while .

If we think about a path through the graph (that is, a Eulerian path, so to be sure to traverse each edge exactly once), the transition can happen starting from three different previous states. Therefore, the flow is potentially greater than the case, for example, when the node a has a single connection. It's not difficult to understand that this concept is strictly related to the second derivative of a function (or, in case of multiple variables, to the Laplace-Beltrami operator ). As an analogy, let's consider the generic heat equation:

This equation describes the behavior of the temperature of a room when a point in the room is suddenly heated. From basic physics concepts, we know that heat will spread until the temperature reaches an equilibrium point and the speed of variation is proportional to the Laplacian of the distribution. If we consider a bidimensional grid at the equilibrium (the derivative with respect to when time becomes null) and we discretize the Laplacian operator considering the incremental ratios, we obtain:

Therefore, at the equilibrium, each point has a value that is the mean of the direct neighbors (that is, in case of a graph, the number of neighbors is encoded in the degree of a node). It's possible to prove the finite-difference equation has a single fixed point that can be found iteratively, starting from every initial condition. Therefore, the graph Laplacian encodes both the structure of the graph and, consequently, its dynamic behavior when moving through it.

In addition to this fundamental idea, label spreading adopts a clamping factor for the labeled samples. If , the algorithm will always reset the labels to the original values (like for label propagation), while with a value in the interval (0, 1], the percentage of clamped labels decreases progressively until , when all the labels are overwritten.

The complete steps of the label spreading algorithm are:

Select an affinity matrix type (KNN or RBF) and compute W
Compute the degree matrix D
Compute the normalized graph Laplacian
Define
Define in the interval (0, 1)
Iterate until the convergence of the following step:

The structure of the equation (in particular, the first part) is straightforward and resembles a standard diffusion differential equation, with the main difference being that we're working with discrete time steps. However, also in this case, the variation (analogous to time derivative) is proportional to the graph Laplacian applied to the labels (which are analogous, for example, to a scalar physical entity like the temperature). It's possible to show (as demonstrated in Chapelle O., Schölkopf B., Zien A., (edited by), Semi-Supervised Learning, The MIT Press, 2010) that this algorithm is equivalent to the minimization of a quadratic cost function with the following structure:

The first term imposes consistency between original labels and estimated ones (for the labeled samples). The second term acts as a normalization factor, forcing the unlabeled terms to become zero, while the third term, which is probably the least intuitive, is needed to guarantee geometrical coherence in terms of smoothness. As we have seen in the previous paragraph, when hard clamping is adopted, the smoothness assumption can be violated. By minimizing this term ( is proportional to ), it's possible to penalize the rapid changes inside the high-density regions. Also in this case, the proof of convergence is very similar to the one for label propagation algorithms, and will be omitted. The interested reader can find it in Chapelle O., Schölkopf B., Zien A., (edited by), Semi-Supervised Learning, The MIT Press, 2010.

Example of label spreading

We can test this algorithm using the scikit-learn implementation. Let's start by creating a very dense dataset:

from sklearn.datasets import make_classification
nb_samples = 5000
nb_unlabeled = 1000
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=100)
Y[nb_samples - nb_unlabeled:nb_samples] = -1

We can train a LabelSpreading instance with a clamping factor alpha=0.2. We want to preserve 80% of the original labels but, at the same time, we need a smooth solution:

from sklearn.semi_supervised import LabelSpreading
ls = LabelSpreading(kernel='rbf', gamma=10.0, alpha=0.2)
ls.fit(X, Y)
Y_final = ls.predict(X)

The result is shown, as usual, together with the original dataset:

Original dataset (left). Dataset after a complete label spreading (right)

As it's possible to see in the first figure (left), in the central part of the cluster (), there's an area of circle dots. Using hard clamping, this island would remain unchanged, violating both the smoothness and clustering assumptions. Setting , it's possible to avoid this problem. Of course, the correct choice of is strictly limited to each single problem. If we know that the original labels are absolutely correct, allowing the algorithm to change them can be counterproductive.

In this case, for example, it would be better to preprocess the dataset, filtering out all those samples that violate the semi-supervised assumptions. If, instead, we are not sure that all samples are drawn from the same p_data, and it's possible to be in the presence of spurious elements, using a higher value can smooth the dataset without any other operation.

Increasing the smoothness with Laplacian regularization

One of the problems of standard label propagation algorithms is that they tend to show abrupt label changes, which are incompatible with the smoothness assumption. There are many strategies to limit this behavior (most of them are analyzed in Belkin M., Niyogi P., Sindhwani V., Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, Journal of Machine Learning Research 7, 2006), and they are generally based on the introduction of a quadratic penalty term in the cost function. In a way similar to the one employed for label spreading, the authors proposed a cost function with the following form:

The presence of the term forces the algorithm to penalize abrupt changes in the same neighborhood. As the Laplacian is directly connected to the affinity matrix, the quadratic penalty term behaves in a way similar to Ridge regularization. When two points have a large affinity and the algorithm tends to assign different labels, the penalty term forces the model to reduce the loss by selecting a smoother transition. Considering a generic classifier, this is equivalent to avoiding excessive oscillations of the separating hypersurface by inducing an implicit linearization.

The mathematical derivation of the algorithm, together with the theory, is a little bit complex. Therefore, we preferred to discuss another algorithm that is based on the behavior of manifolds (however, I invite the smartest readers to define the cost function and optimize it using a standard method such as BFGS).

The complete theory of Laplacian regularization methods is out of the scope of this book (many details can be found in Lee J. M., Introduction to Smooth Manifolds, Springer, 2012), but it's helpful for the reader to know that the graph Laplacian defines an operator on the underlying manifold. According to the theory, it's possible to eigendecompose the operator to find a basis of eigenfunctions. It can be proven that the smoothness of the basic functions is proportional to the associated eigenvalues. Therefore, the authors proposed to compute and select the first k smallest eigenvalues of and to establish a labeling function (that is, a function that encodes the labeling for both labeled and unlabeled points) built using only the first k eigenfunctions (so as to maximize the smoothness). Therefore, given the matrix of the first k eigenvectors as columns , and a variable vector (subject to optimization), it's possible to build a cost function based on the labeled points:

The minimum value corresponds to and the labels can be determined by thresholding the dot product and taking the sign: .

Let's try this method using a dataset containing 200 points with 150 unlabeled ones:

from sklearn.datasets import make_classification
nb_samples=200
nb_unlabeled=150
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=1000)
Y[Y == 0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

The original dataset is shown in the following figure:

Original dataset employed in the example of Laplacian regularization

We can now build the graph Laplacian using an RBF affinity matrix. In this case, we have chosen , but I invite the reader to test different values and compare the results:

import numpy as np
k = 50
def rbf(x1, x2, gamma=0.1):
    n = np.linalg.norm(x1 - x2, ord=1)
    return np.exp(-gamma * np.power(n, 2))
W_rbf = np.zeros((nb_samples, nb_samples))
for i in range(nb_samples):
    for j in range(nb_samples):
        if i == j:
            W_rbf[i, j] = 0.0
        else:
            W_rbf[i, j] = rbf(X[i], X[j])
D_rbf = np.diag(np.sum(W_rbf, axis=1))
L_rbf = D_rbf - W_rbf

Once the Laplacian is ready, we can eigendecompose it and select the first k eigenvectors (in our case, we have chosen k = 50, but, again, this is a hyperparameter that the data scientist should check in every specific scenario):

import numpy as np
w, v = np.linalg.eig(L_rbf)
sw = np.argsort(w)[0:k]
V = v[:, sw]
theta = np.random.normal(0.0, 0.1, size=(1, k))
Yu = np.zeros(shape=(nb_unlabeled,))

The last step is building the cost function and minimizing it (we have chosen to work with the BFGS algorithm):

from scipy.optimize import minimize
def objective(t):
    return np.sum(np.power(Y - np.dot(t, V.T), 2))
result = minimize(objective, 
                  theta, 
                  method="BFGS",
                  options={
                      "maxiter": 500000,
                  })

The final labeling is obtained by taking the result of the optimization and using the sign function:

Y_final = np.sign(np.dot(result["x"], V.T))

The result of the process is shown in the following figure:

Original dataset (left). Dataset after labeling using the Laplacian regularization algorithm (right)

We can immediately observe that the labeling is very smooth, but the algorithm, analogous to a standard label propagation, has not clamped the labels and this has resulted in some changes (I invite the reader to modify the code as an exercise and to clamp the original labels). However, the main clusters kept their structure and, moreover, induced their labels to the neighbors. This behavior can be positive if the dataset is noisy and we can trust only the denser regions, but it might be undesirable when the labeled points have been controlled by an expert. The point at (2.7, -3) is an outlier that the algorithm failed to label correctly (indeed, it labeled it using the Class 1). Such kinds of problems are a consequence of the choice of the parameter . In fact, in this particular case, we have the presence of high-density and low-density regions. While the algorithm works fine with the high-density regions, it can have problems with outliers. A possible mitigation strategy is to increase while checking how many correctly labeled points change their labels. In some cases, an exploratory procedure is enough to find a slightly different value that avoids the mislabeling of outliers while keeping the correct labels unchanged.

The level of smoothness can be controlled by changing both the number of eigenfunctions and the value of the parameter . In this particular case, given the simplicity of the dataset, the RBF provides the larger contribution because it determines the structure of the neighborhoods and either limits or incentivizes the label propagation.

As an exercise, I invite the reader to test this approach using a more complex dataset (it can always be bidimensional, but possibly non-linear), comparing the results with a standard label propagation. In general, the main advantages are more tangible when the complexity of the dataset is very high (together with the dimensionality). In these cases, the reader can try to use one of the methods discussed in the remaining part of the chapter to project high-dimensional datasets into a bidimensional plane, so as to visualize the results and have a better understanding of the underlying dynamics.

Label propagation based on Markov random walks

The goal of this algorithm proposed by Zhu and Ghahramani is to find the probability distribution of target labels for unlabeled samples given a mixed dataset. This objective is achieved through the simulation of a stochastic process, where each unlabeled sample walks through the graph until it reaches a stationary absorbing state, a labeled sample, where it stops acquiring the corresponding label. The main difference with other similar approaches is that in this case, we consider the probability of reaching a labeled sample. In this way, the problem acquires a closed form and can be easily solved.

The first step is to always build a k-nearest neighbors graph with all N samples, and define a weight matrix W based on an RBF kernel:

W_ij = 0 means that , and are not neighbors, and W_ii = 0. The transition probability matrix is built similarly to the scikit-learn label propagation algorithm, as:

In a more compact way, it can be rewritten as P = D^-1W. If we now consider a test data point, starting from the state and randomly walking until an absorbing labeled state is found (we call this label ), the probability (referred to as binary classification) can be expressed as:

When is labeled, the state is final, and it is represented by the indicator function based on the condition y_i = 1. When the sample is unlabeled, we need to consider the sum of all possible transitions starting from and ending in the closest absorbing state, with label y = 1 weighted by the relative transition probabilities.

We can rewrite this expression in matrix form. If we create a vector , where the first component is based on labeled points and the second on unlabeled ones, we can write:

If we now expand the matrices, we get:

As we are interested only in the unlabeled points, we can consider just the second equation:

Simplifying the expression, we get the following linear system:

The term (D_UU -W_UU) is the unlabeled-unlabeled part of the unnormalized graph Laplacian . By solving this system, we can get the probabilities for the class y = 1 for all unlabeled points. By thresholding the probabilities (normally at 0.5), it's possible to transform the soft labeling into a hard one.

Example of label propagation based on Markov random walks

For this Python example of label propagation based on Markov random walks, we are going to use a bidimensional dataset containing 50 labeled points belonging to two different classes, and 1,950 unlabeled ones:

from sklearn.datasets import make_blobs
nb_samples = 2000
nb_unlabeled = 1950
nb_classes = 2
X, Y = make_blobs(n_samples=nb_samples, 
                  n_features=2, 
                  centers=nb_classes,
                  cluster_std=2.5,
                  random_state=1000)
Y[Y == 0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

The plot of the dataset is shown in the following diagram (the crosses represent the unlabeled samples):

Partially labeled dataset

We can now create the graph (using n_neighbors=15) and the weight matrix:

from sklearn.neighbors import kneighbors_graph
W = kneighbors_graph(X, n_neighbors=15, mode='connectivity', include_self=True).toarray()

Now, we need to compute the unlabeled part of the unnormalized graph Laplacian and the unlabeled-labeled part of the matrix W:

D = np.diag(np.sum(W, axis=1))
L = D – W
Luu = L[nb_samples - nb_unlabeled:, nb_samples - nb_unlabeled:]
Wul = W[nb_samples - nb_unlabeled:, 0:nb_samples - nb_unlabeled,]
Yl = Y[0:nb_samples - nb_unlabeled]

At this point, it's possible to solve the linear system using the NumPy function np.linalg.solve(), which accepts as parameters the matrix A and the vector of a generic system in the form . When the matrix A is very large, the system could be ill-conditioned. I suggest checking the condition number before solving the system. If it's large (for example, , it's preferable to use one of the other methods previously discussed.

Once we have the solution, we can merge the new labels with the original ones (where the unlabeled samples have been marked with -1). In this case, we don't need to convert the probabilities, because we are using 0 and 1 as labels. In general, it's necessary to use a threshold (0.5) to select the right label:

Yu = np.round(np.linalg.solve(Luu, np.dot(Wul, Yl)))
Y[nb_samples - nb_unlabeled:] = Yu.copy()

Replotting the dataset, we get:

Original dataset (left). Dataset after a complete Markov random walk label propagation (right)

As expected, without any iteration, the labels have been successfully propagated to all data points, meeting the requirements of the clustering assumption. Both this algorithm and label propagation can work using a closed-form solution, so they are very fast even when the number of samples is high; however, there's a fundamental problem regarding the choice of / for the RBF kernel. As the same authors Zhu and Ghahramani remark, there is no standard solution, but it's possible to consider the cases where and when . In the first case, only the nearest point has an influence, while in the second case, the influence is extended to the whole sample space, and the unlabeled points all tend to acquire the same label.

The authors suggest considering the entropy of all samples, trying to find the best value that minimizes it. This solution can be very effective, but sometimes the minimum entropy corresponds to a label configuration that isn't impossible to achieve using these algorithms. The best approach is to try different values (at different scales) and select the one corresponding to a valid configuration with the lowest entropy. In our case, it's possible to compute the entropy of the unlabeled samples as:

The Python code to perform this computation is:

Pu = np.linalg.solve(Luu, np.dot(Wul, Yl))
H = -np.sum(Pu * np.log(Pu + 1e-6))

The term 1e-6 has been added to avoid numerical problems when the probability is null. Repeating this process for different values allows us to find a set of candidates that can be restricted to a single value with a direct evaluation of the labeling accuracy (for example, when there is no precise information about the real distribution, it's possible to consider the coherence of each cluster and the separation between them).

Another approach is called class rebalancing, and it's based on the idea of reweighting the probabilities of unlabeled samples to rebalance the number of points belonging to each class when the new unlabeled samples are added to the set. If we have N labeled points and M unlabeled ones, with K classes, the weight factor w_j for the class j can be obtained as:

The numerator is the average computed over the labeled samples belonging to class k, while the denominator is the average over the unlabeled ones whose estimated class is k. The final decision about a class is no longer based only on the highest probability, but on:

In this section, we have discussed some common strategies to perform an effective label propagation in different scenarios. In the next section, we are going to exploit the manifold assumption to perform non-linear dimensionality reduction of complex datasets.

Manifold learning

In Chapter 3, Introduction to Semi-Supervised Classification, we discussed the manifold assumption, saying that high-dimensional data normally lies on low-dimensional manifolds. Of course, this is not a theorem, but in many real cases, the assumption is proven to be correct, and it allows us to work with non-linear dimensionality reduction algorithms that would be otherwise unacceptable. In this section, we're going to analyze some of these algorithms. They are all implemented in scikit-learn, so it's easy to try them with complex datasets.

Isomap

Isomap is one of the simplest algorithms, and it's based on the idea of reducing dimensionality while trying to preserve the geodesic distances (which are the lengths of the shortest paths between a couple of points on the manifold) measured on the original manifold where the input data lies. The algorithm works in three steps. The first operation is a KNN clustering and the construction of the following graph. The vertices will be the samples, while the edges represent the connections among nearest neighbors, and their weight is proportional to the distance to the corresponding neighbor.

The second step adopts the Dijkstra algorithm (check Cormen T. H., Leiserson C. E., Rivest R. L., Introduction to Algorithms, The MIT Press, 2009, for further details) to compute the shortest pairwise distances on the graph of all couples of samples. In the following graph, there's a portion where some shorter distances are marked with a thicker line:

Example of a graph with marked shortest distances

For example, as x₃ is a neighbor of x₅ and x₇, applying the Dijkstra algorithm, we could get the shortest paths and . The computational complexity of this step is about , which is lower than O(n³) when k << n (a condition normally met); however, for large graphs (with n >> 1), this is often the most expensive part of the whole algorithm.

The third step is called metric multidimensional scaling, which is a technique for finding a low-dimensional representation while trying to preserve the inner product among samples. If we have a P-dimensional dataset X, the algorithm must find a Q-dimensional set , with Q < P minimizing the function:

As proven in Chapelle O., Schölkopf B., Zien A., (edited by), Semi-Supervised Learning, The MIT Press, 2010, the optimization is achieved by taking the top Q eigenvectors of the Gram matrix (or, in matrix form, if ); however, as the Isomap algorithm works with pairwise distances, we need to compute the matrix D of squared distances:

If the X dataset is zero-centered, it's possible to derive a simplified Gram matrix from D, as described by M. A. A. Cox and T. F. Cox:

Isomap computes the top Q eigenvalues of G_D and the corresponding eigenvectors and determines the Q-dimensional vectors as:

As we're going to discuss in Chapter 13, Component Analysis and Dimensionality Reduction (and also as pointed out by Saul, Weinberger, Sha, Ham, and Lee in Saul L. K., Weinberger K. Q., Sha F., Ham J., and Lee D. D., Spectral Methods for Dimensionality Reduction, UCSD, 2006), this kind of projection is also exploited by Principal Component Analysis (PCA), which finds out the direction with the highest variance, corresponding to the top k eigenvectors of the covariance matrix.

In fact, when applying the SVD to the dataset X, we get:

The diagonal matrix contains the eigenvalues of both XX^T and X^TX; therefore, the eigenvalues of G are equal to , where are the eigenvalues of the covariance matrix . Hence, Isomap achieves the dimensionality reduction, trying to preserve the pairwise distances, while projecting the dataset in the subspace determined by a group of eigenvectors, where the maximum explained variance is achieved. In terms of information theory, this condition guarantees the minimum loss with an effective reduction of dimensionality.

Scikit-learn also implements the Floyd-Warshall algorithm, which is slightly slower. For further information, please refer to Cormen T. H., Leiserson C. E., Rivest R. L., Introduction to Algorithms, The MIT Press, 2009.

Example of Isomap

We can now test the scikit-learn Isomap implementation using the Olivetti faces dataset (provided by AT&T Laboratories, Cambridge), which is made up of 400 64 × 64 grayscale portraits belonging to 40 different former employees. Examples of these images are shown here:

Subset of the Olivetti faces dataset

The original dimensionality is 4,096, but we want to visualize the dataset in two dimensions. It's important to understand that using the Euclidean distance for measuring the similarity of images might not the best choice, and it's surprising to see how well the samples are clustered by such a simple algorithm.

The first step is loading the dataset:

from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()

The faces dictionary contains three main elements:

images: Image array with shape 400 × 64 × 64
data: Flattened array with shape 400 × 4096
target: Array with shape 400 × 1 containing the labels (0, 39)

At this point, we can instantiate the Isomap class provided by scikit-learn, setting n_components=2 and n_neighbors=5 (the reader can try different configurations), and then fitting the model:

from sklearn.manifold import Isomap
isomap = Isomap(n_neighbors=5, n_components=2)
X_isomap = isomap.fit_transform(faces['data'])

As the resulting plot with 400 elements is very dense, I preferred to show in the following plot only the first 100 samples:

Isomap applied to 100 samples drawn from the Olivetti faces dataset

As it's possible to see, samples belonging to the same class (or sub-class, if, for example, the same person has different expressions or wears/doesn't wear glasses) are grouped in rather dense agglomerates.

The classes that seem better separated are 7 and 1. Checking the corresponding faces, for class 7, we get:

Samples belonging to class 7

The set contains portraits of a young woman with a fair complexion, quite different from the majority of the other people. Instead, for class 1, we get:

Samples belonging to class 1

In this case, it's a man with big glasses and a particular mouth expression. In the dataset, there are only a few people with glasses, and one of them has a dark beard. We can conclude that Isomap created a low-dimensional representation that is really coherent with the original geodesic distances. In some cases, there's a partial clustering overlap that can be mitigated by increasing the dimensionality or adopting a more complex strategy.

Locally linear embedding

Contrary to Isomap, which works with pairwise distances, this algorithm is based on the assumption that a high-dimensional dataset lying on a smooth manifold can have local linear structures that it tries to preserve during the dimensionality reduction process. Locally Linear Embedding (LLE), like Isomap, is based on three steps. The first one is applying the KNN algorithm to create a directed graph (in Isomap, it was undirected), where the vertices are the input samples and the edges represent a neighborhood relationship. As the graph is direct, a point can be a neighbor of , but the opposite could be false. This means that the weight matrix can be asymmetric.

The second step is based on the main assumption of local linearity. For example, consider the following graph:

Graph where a neighborhood is marked with a shaded rectangle

The rectangle delimits a small neighborhood. If we consider the point , the local linearity assumption allows us to think that , without considering the cyclic relationship. This concept can be formalized for all N P-dimensional points through the minimization of the following function:

In order to address the problem of low-rank neighborhood matrices (think about the previous example, with a number of neighbors equal to 20), scikit-learn also implements a regularizer that is based on a small arbitrary additive constant that is added to the local weights (according to a variant called Modified LLE or MLLE). At the end of this step, the matrix W that better matches the linear relationships among neighbors will be selected for the next phase.

In the third step, LLE tries to determine the low-dimensional (Q < P) representation that best reproduces the original relationship among nearest neighbors. This is achieved by minimizing the following function:

The solution for this problem is obtained through the adoption of the Rayleigh-Ritz method, an algorithm to extract a subset of eigenvectors and eigenvalues from a very large sparse matrix. For further details, read Schofield G. Chelikowsky J. R.; Saad Y., A spectrum slicing method for the Kohn–Sham problem, Computer Physics Communications. 183, 2012. The initial part of the final procedure consists of determining the matrix D:

It's possible to prove the last eigenvector (if the eigenvalues are sorted in descending order, it's the bottom one) has all components , and the corresponding eigenvalue is null. As Saul and Roweis (Saul L. K., Roweis S. T., An introduction to locally linear embedding, 2001) pointed out, all the other Q eigenvectors (from the bottom) are orthogonal, and this allows them to have zero-centered embedding. Hence, the last eigenvector is discarded, while the remaining Q eigenvectors determine the embedding vectors .

For further details about MLLE, please refer to Zhang Z., Wang J., MLLE: Modified Locally Linear Embedding Using Multiple Weights, NIPS, 2006.

Example of LLE

We can now apply this algorithm to the Olivetti faces dataset, instantiating the scikit- learn class LocallyLinearEmbedding with n_components=2 and n_neighbors=15:

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2)
X_lle = lle.fit_transform(faces['data'])

The result (limited to the first 100 data points) is shown in the following plot:

LLE applied to 100 samples drawn from the Olivetti faces dataset

Even if the strategy is different from Isomap, we can determine some coherent clusters. In this case, the similarity is obtained through the conjunction of small linear blocks; for the faces, they can represent particular micro-features, like the shape of the nose or the presence of glasses, that remain invariant in the different portraits of the same person. LLE is, in general, preferable when the original dataset is intrinsically locally linear, possibly lying on a smooth manifold.

In other words, LLE is a reasonable choice when small parts of a sample are structured in a way that allows the reconstruction of a point given the neighbors and the weights. This is often true for images, but it can be difficult to determine for a generic dataset. When the result doesn't reproduce the original clustering, it's possible to employ the next algorithm, Laplacian Spectral Embedding, or t-SNE, which is one of the most advanced and will be discussed later.

Laplacian Spectral Embedding

This algorithm, based on the spectral decomposition of a graph Laplacian, has been proposed in order to perform a non-linear dimensionality reduction to try to preserve the nearness of points in the P-dimensional manifold when remapping on a Q-dimensional (with Q < P) subspace.

The procedure is very similar to the other algorithms. The first step is a KNN clustering to generate a graph where the vertices (we can assume to have N elements) are the samples, and the edges are weighted using an RBF kernel:

The resulting graph is undirected and symmetric. We can now define a pseudo-degree matrix D:

The low-dimensional representation is obtained by minimizing the function:

If the two points and are near, the corresponding W_ij is close to 1, while it tends to 0 when the distance tends to . D_ii is the sum of all weights originating from (and the same for D_jj). Now, let's suppose that is very close only to so, to approximate . The resulting formula is a square loss based on the difference between the vectors and . When instead there are multiple closeness relationships to consider, the factor W_ij divided by the square root of D_ii D_jj allows reweighting the new distances to find the best trade-off for the whole dataset. In practice, is not minimized directly. In fact, it's possible to prove that the minimum can be obtained through the spectral decomposition of the symmetric normalized graph Laplacian (the name derives from this procedure):

Just like for the LLE algorithm, Laplacian Spectral Embedding also works with the bottom Q + 1 eigenvectors. The mathematical theory behind the last step is always based on the application of the Rayleigh-Ritz method. The last one is discarded, and the remaining Q determines the low-dimensional representation .

Example of Laplacian Spectral Embedding

Let's apply this algorithm to the same dataset using the scikit-learn class SpectralEmbedding, with n_components=2 and n_neighbors=15:

from sklearn.manifold import SpectralEmbedding
se = SpectralEmbedding(n_components=2, n_neighbors=15, random_state=1000)
X_se = se.fit_transform(faces['data'])

The resulting plot (zoomed in due to the presence of a high-density region) is shown in the following graph:

Laplacian Spectral Embedding applied to the Olivetti faces dataset

Even in this case, we can see that some classes are grouped into small clusters, but at the same time, we observe many agglomerates where there are mixed samples. Both this and the previous method work with local pieces of information, trying to find low-dimensional representations that could preserve the geometrical structure of micro-features.

This condition drives to a mapping where close points share local features (this is almost always true for images, but it's very difficult to prove for generic samples). Therefore, we can observe small clusters containing elements belonging to the same class, but also some apparent outliers, which, on the original manifold, can be globally different even if they share local patches. Instead, methods such as Isomap or t-SNE work with the whole distribution and try to determine a representation that is almost isometric with the original dataset considering its global properties.

t-SNE

This algorithm, proposed by Van der Mateen and Hinton, and formally known as t-Distributed Stochastic Neighbor Embedding, is one of the most powerful manifold dimensionality reduction techniques. Contrary to the other methods, this algorithm starts with a fundamental assumption: the similarity between two N-dimensional points and can be represented as the conditional probability where each point is represented by a Gaussian distribution centered in and with variance . The variances are selected starting from the desired perplexity, defined as:

Low-perplexity values indicate low uncertainty and are normally preferable. In common t-SNE tasks, values in the range 5-50 are normally acceptable.

The assumption on the conditional probabilities can be interpreted thinking that if two samples are very similar, the probability associated with the first sample conditioned to the second one is high, while dissimilar points yield low conditional probabilities. For example, thinking about images, a point centered in the pupil can have as neighbors some points belonging to an eyelash. In terms of probabilities, we can think that p(eyelash|pupil) is quite high, while p(nose|pupil) is obviously lower. t-SNE models these conditional probabilities as:

The probabilities are set to zero, so the previous formula can be extended to the whole graph. In order to solve the problem in an easier way, the conditional probabilities are also symmetrized:

The probability distribution so obtained represents the high-dimensional input relationship. As our goal is to reduce the dimensionality to a value M < N, we can think about a similar probabilistic representation for the target points , using a Student's-t distribution with one degree of freedom:

We want the low-dimensional distribution Q to be as close as possible to the high-dimensional distribution P; therefore, the aim of the t-SNE algorithm is to minimize the Kullback-Leibler divergence between P and Q:

The first term is the entropy of the original distribution P, while the second one is the cross-entropy H(P, Q), which has to be minimized to solve the problem. The best approach is based on a gradient-descent algorithm (fully analyzed in Chapter 17, Modeling Neural Networks), but there are also some useful variations that can improve the performance discussed in Van der Maaten L.J.P., Hinton G.E., Visualizing High-Dimensional Data Using t-SNE, Journal of Machine Learning Research 9 (Nov), 2008.

Example of t-distributed stochastic neighbor embedding

We can apply this powerful algorithm to the same Olivetti faces dataset, using the scikit-learn class TSNE with n_components=2 and perplexity=20:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=20, random_state=1000)
X_tsne = tsne.fit_transform(faces['data'])
print("Final KL divergence: {}".format(tsne.kl_divergence_))

The output of the previous snippet is:

Final KL divergence: 0.5993183851242065

The final Kullback-Leibler divergence is about 0.6, which is a low value (the minimum achievable with a perplexity equal to 20), but probably not enough in some tasks. Let's start by looking at the graphical result for all 400 points:

t-SNE with perplexity equal to 20 applied to the Olivetti faces dataset

A visual inspection of the label distribution can confirm that t-SNE recreated a very good clustering structure starting from the original high-dimensional distribution. We can reasonably suppose that the points belonging to the same blob are close, but also relatively separated. However, we want to try to push the algorithm toward its limit by reducing the perplexity to 2:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=2, random_state=1000)
X_tsne = tsne.fit_transform(faces['data'])
print("Final KL divergence: {}".format(tsne.kl_divergence_))

The output is now:

Final KL divergence: 0.37610241770744324

Clearly, the distributions are now more overlapped and it's very difficult to obtain better results. Let's check again the visual representation of the dataset:

t-SNE with perplexity equal to 2 applied to the Olivetti faces dataset

The clusters are much more compact (unfortunately, the labels are often overlapped), confirming that the original dataset is naturally clustered and there are high-density regions split into groups of lower density (but still high) representing particular sets of pictures of the same person but with slightly different details. As t-SNE is strongly based on both cluster and smoothness assumptions, it seems that a larger perplexity yields smoother separations.

However, it's always helpful to consider the nature of the data-generating process. In this particular case, we have a set of portraits that can obviously be imagined as a coherent class of elements with respect, for example, to pictures of chairs. On the other side, we are interested in the sub-manifolds where the portraits of the same subject lie. Therefore, we need to get rid of some smoothness and rely more on the cluster assumption in order to achieve a better separation. This principle is general, and should be applied in all contexts where a homogeneous data-generating process must be analyzed at a local level.

The sub-manifolds are often less smooth than the overall manifold where the whole distribution lies (think about the surface of an orange).

Hence, we are not contradicting the initial assumptions when discovering that a higher detail yields a more abrupt class separation, as this is a consequence of a zoom process operating in an area that is already very smooth, but with some thin spots representing the sub-clusters with very similar features.

This algorithm can be employed in several non-linear dimensionality reduction tasks, such as images, word embeddings, or complex feature vectors. Its main strength is hidden in the assumption that we can consider similarities as probabilities, without the need to impose any constraint on the pairwise distances, either global or local. Under a certain viewpoint, it's possible to consider t-SNE as a reverse multiclass classification problem based on a cross-entropy cost function. Our goal is to find the labels (low-dimensional representation) given the original distribution and an assumption about the output distribution.

At this point, we could try to answer a natural question: which algorithm must be employed? The obvious answer is it depends on the single problem. When it's useful to reduce the dimensionality, preserving the global similarity among vectors (this is the case when the samples are long feature vectors without local properties, such as word embeddings or data encodings), t-SNE or Isomap are good choices. When, instead, it's necessary to keep the local distances (for example, the structure of a visual patch that can be shared by different samples also belonging to different classes) as close as possible to the original representation, LLE or spectral embedding algorithms are preferable.

Summary

In this chapter, we have introduced the most important label propagation techniques. In particular, we have seen how to build a dataset graph based on a weighting kernel, and how to use the geometric information provided by unlabeled samples to determine the most likely class. The basic approach works by iterating the multiplication of the label vector times the weight matrix until a stable point is reached and we have proven that, under simple assumptions, it is always possible.

Another approach, implemented by scikit-learn, is based on the transition probability from a state (represented by a sample) to another one, until the convergence to a labeled point. The probability matrix is obtained using a normalized weight matrix to encourage transitions associated with close points and discourage all the long jumps. The main drawback of these two methods is the hard clamping of labeled samples; this constraint can be useful if we trust our dataset, but it can be a limitation in the presence of outliers whose labels have been wrongly assigned.

Label spreading solves this problem by introducing a clamping factor that determines the percentage of clamped labels. The algorithm is very similar to label propagation, but it's based on graph Laplacian and can be employed in all those problems where the data-generating distribution is not well determined and the probability of noise is high.

The propagation based on Markov random walks is a very simple algorithm that can estimate the class distribution of unlabeled samples through a stochastic process. It's possible to imagine it as a test sample that walks through the graph until it reaches a final labeled state (acquiring the corresponding label). The algorithm is very fast and it has a closed-form solution that can be found by solving a linear system.

The next topic was the introduction of manifold learning with the Isomap algorithm, which is a simple but powerful solution based on a graph built using a KNN algorithm (this is a common step in most of these algorithms). The original pairwise distances are processed using the multidimensional scaling technique, which allows a low-dimensional representation to be obtained where the distances between samples are preserved.

Two different approaches, based on local pieces of information, are locally linear embedding and Laplacian Spectral Embedding. The former tries to preserve the local linearity present in the original manifold, while the latter, which is based on the spectral decomposition of the normalized graph Laplacian, tries to preserve the nearness of original samples. Both methods are suitable for all those tasks where it's important not to consider the whole original distribution, but the similarity induced by small data patches.

We closed this chapter by discussing t-SNE, which is a very powerful algorithm that tries to model a low-dimensional distribution that is as similar as possible to the original high-dimensional one. This task is achieved by minimizing the Kullback-Leibler divergence between the two distributions. t-SNE is a state-of-the-art algorithm, useful whenever it's important to consider the whole original distribution and the similarity between entire samples.

In the next chapter, we are going to introduce some common unsupervised algorithms to perform clustering and pattern discovery. Some of the concepts we are going to exploit have already been discussed in this chapter. Therefore, I suggest studying all the sections before continuing your reading.

Table of Contents for
Graph-Based Semi-Supervised Learning

5

Graph-Based Semi-Supervised Learning

Label propagation

Example of label propagation

Label propagation in scikit-learn

Label spreading

Example of label spreading

Increasing the smoothness with Laplacian regularization

Label propagation based on Markov random walks

Example of label propagation based on Markov random walks

Manifold learning

Isomap

Example of Isomap

Locally linear embedding

Example of LLE

Laplacian Spectral Embedding

Example of Laplacian Spectral Embedding

t-SNE

Example of t-distributed stochastic neighbor embedding

Summary

Further reading

Table of Contents for Graph-Based Semi-Supervised Learning

Create new playlist

Sign In

Sign Up

5

Graph-Based Semi-Supervised Learning

Label propagation

Example of label propagation

Label propagation in scikit-learn

Label spreading

Example of label spreading

Increasing the smoothness with Laplacian regularization

Label propagation based on Markov random walks

Example of label propagation based on Markov random walks

Manifold learning

Isomap

Example of Isomap

Locally linear embedding

Example of LLE

Laplacian Spectral Embedding

Example of Laplacian Spectral Embedding

t-SNE

Example of t-distributed stochastic neighbor embedding

Summary

Further reading

Table of Contents for
Graph-Based Semi-Supervised Learning