Example of spectral clustering with Scikit-Learn

In this example, we are going to use the sinusoidal dataset previously shown. The first step is creating it (with 1,000 samples): 

import numpy as np

from sklearn.preprocessing import StandardScaler

nb_samples = 1000

X = np.zeros(shape=(nb_samples, 2))

for i in range(nb_samples):
X[i, 0] = float(i)

if i % 2 == 0:
X[i, 1] = 1.0 + (np.random.uniform(0.65, 1.0) * np.sin(float(i) / 100.0))
else:
X[i, 1] = 0.1 + (np.random.uniform(0.5, 0.85) * np.sin(float(i) / 100.0))

ss = StandardScaler()
Xs = ss.fit_transform(X)

At this point, we can try to cluster it using K-means (with n_clusters=2):

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, random_state=1000)
Y_km = km.fit_predict(Xs)

The result is shown in the following graph:

K-means clustering result using the sinusoidal dataset

As expected, K-means isn't able to separate the two sinusoids. The reader is free to try with different parameters, but the result will always be unacceptable because K-means bidimensional clusters are circles and no valid configurations exist. We can now employ spectral clustering using an affinity matrix based on the KNN algorithm (in this case, Scikit-Learn can produce a warning because the graph is not fully connected, but this normally doesn't affect the results). Scikit-Learn implements the SpectralClustering class, whose most important parameters are n_clusters, the number of expected clusters; affinity, which can be either 'rbf' or 'nearest_neighbors'gamma (only for RBF); and n_neighbors (only for KNN). For our test, we have chosen to have 20 neighbors:

from sklearn.cluster import SpectralClustering

sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', n_neighbors=20, random_state=1000)
Y_sc = sc.fit_predict(Xs)

The result of the spectral clustering is shown in the following graph:

Spectral clustering result using the sinusoidal dataset

As expected, the algorithm was able to separate the two sinusoids perfectly. As an exercise, I invite the reader to apply this method to the MNIST dataset, using both an RBF (with different gamma values) and KNN (with different numbers of neighbors). I also suggest to replot the t-SNE diagram and compare all the assignment errors. As the clusters are strictly non-convex, we don't expect a high Silhouette score. Other useful exercises can be: drawing the Silhouette plot and checking the result, assigning ground truth labels, and measuring the homogeneity and the completeness.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.35.247