Example of K-means with Scikit-Learn

In this example, we continue using the MNIST dataset (the X_train array is the same defined in the paragraph dedicated to KNN), but we want also to analyze different clustering evaluation methods. The first step is visualizing the inertia corresponding to different numbers of clusters. We are going to use the KMeans class, which accepts the n_clusters parameter and employs the K-means++ initialization as the default method (as explained in the previous section, in order to find the best initial configuration, Scikit-Learn performs several attempts and selects the configuration with the lowest inertia; it's possible to change the number of attempts through the n_iter parameter):

import numpy as np

from sklearn.cluster import KMeans

min_nb_clusters = 2
max_nb_clusters = 20

inertias = np.zeros(shape=(max_nb_clusters - min_nb_clusters + 1,))

for i in range(min_nb_clusters, max_nb_clusters + 1):
km = KMeans(n_clusters=i, random_state=1000)
km.fit(X_train)
inertias[i - min_nb_clusters] = km.inertia_

We are supposing to analyze the range [2, 20]. After each training session, the final inertia can be retrieved using the inertia_ instance variable. The following graph shows the plot of the values as a function of the number of clusters:

Inertia as a function of the number of clusters

As expected, the function is decreasing, starting from a value of about 7,500 and reaching about 3,700 with 20 clusters. In this case, we know that the real number is 10, but it's possible to discover it by observing the trend. The slope is quite high before 10, but it starts decreasing more and more slowly after this threshold. This is a signal that informs us that some clusters are not well separated, even if their internal cohesion is high. In order to confirm this hypothesis, we can set n_clusters=10 and, first of all, check the centroids at the end of the training process:

km = KMeans(n_clusters=10, random_state=1000)
Y = km.fit_predict(X_train)

The centroids are available through the cluster_centers_ instance variable. In the following screenshot, there's a plot of the corresponding bidimensional arrays:

K-means centroid at the end of the training process

All the digits are present and there are no duplicates. This confirms that the algorithm has successfully separated the sets, but the final inertia (which is about 4,500) informs us that there are probably wrong assignments. To obtain confirmation, we can plot the dataset using a dimensionality-reduction method, such as t-SNE (see Chapter 3, Graph-Based Semi-Supervised Learning for further details):

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=20.0, random_state=1000)
X_tsne = tsne.fit_transform(X_train)

At this point, we can plot the bidimensional dataset with the corresponding cluster labels:

t-SNE representation of the MNIST dataset; the labels correspond to the clusters

The plot confirms that the dataset is made up of well-separated blobs, but a few samples are assigned to the wrong cluster (this is not surprising considering the similarity between some pairs of digits). An important observation can further explain the trend of the inertia. In fact, the point where the slope changes almost abruptly corresponds to 9 clusters. Observing the t-SNE plot, we can immediately discover the reason: the cluster corresponding to the digit 7 is indeed split into 3 blocks. The main one contains the majority of samples, but there are another 2 smaller blobs that are wrongly attached to clusters 1 and 9. This is not surprising, considering that the digit 7 can be very similar to a distorted 1 or 9. However, these two spurious blobs are always at the boundaries of the wrong clusters (remember that the geometric structures are hyperspheres), confirming that the metric has successfully detected a low similarity. If a group of wrongly assigned samples were in the middle of a cluster, it would have meant that the separation failed dramatically and another method should be employed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.101.109