In this example, we are going to use the sinusoidal dataset previously shown. The first step is creating it (with 1,000 samples):
import numpy as np
from sklearn.preprocessing import StandardScaler
nb_samples = 1000
X = np.zeros(shape=(nb_samples, 2))
for i in range(nb_samples):
X[i, 0] = float(i)
if i % 2 == 0:
X[i, 1] = 1.0 + (np.random.uniform(0.65, 1.0) * np.sin(float(i) / 100.0))
else:
X[i, 1] = 0.1 + (np.random.uniform(0.5, 0.85) * np.sin(float(i) / 100.0))
ss = StandardScaler()
Xs = ss.fit_transform(X)
At this point, we can try to cluster it using K-means (with n_clusters=2):
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2, random_state=1000)
Y_km = km.fit_predict(Xs)
The result is shown in the following graph:
As expected, K-means isn't able to separate the two sinusoids. The reader is free to try with different parameters, but the result will always be unacceptable because K-means bidimensional clusters are circles and no valid configurations exist. We can now employ spectral clustering using an affinity matrix based on the KNN algorithm (in this case, Scikit-Learn can produce a warning because the graph is not fully connected, but this normally doesn't affect the results). Scikit-Learn implements the SpectralClustering class, whose most important parameters are n_clusters, the number of expected clusters; affinity, which can be either 'rbf' or 'nearest_neighbors'; gamma (only for RBF); and n_neighbors (only for KNN). For our test, we have chosen to have 20 neighbors:
from sklearn.cluster import SpectralClustering
sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', n_neighbors=20, random_state=1000)
Y_sc = sc.fit_predict(Xs)
The result of the spectral clustering is shown in the following graph:
As expected, the algorithm was able to separate the two sinusoids perfectly. As an exercise, I invite the reader to apply this method to the MNIST dataset, using both an RBF (with different gamma values) and KNN (with different numbers of neighbors). I also suggest to replot the t-SNE diagram and compare all the assignment errors. As the clusters are strictly non-convex, we don't expect a high Silhouette score. Other useful exercises can be: drawing the Silhouette plot and checking the result, assigning ground truth labels, and measuring the homogeneity and the completeness.