Silhouette score

This measure doesn't need to know the ground truth and can be used to check, at the same time, the intra-cluster cohesion and the inter-cluster separation. In order to define the Silhouette score, we need to introduce two auxiliary functions. The first one is the average intra-cluster distance of a sample x_i belonging to a cluster C_j:

In the previous expression, n(k) is the number of samples assigned to the cluster C_j and d(a, b) is a standard distance function (in the majority of cases, the Euclidean distance is chosen). We need also to define the lowest inter-cluster distance which can be interpreted as the average nearest-cluster distance. In the sample x_i ∈ C_j, let's call C_t the nearest cluster; therefore, the function is defined as:

The Silhouette score for sample x_i is:

The value of s(x_i), like for the Adjusted Rand Index, is bounded between -1 and 1. A value close to -1 indicates that b(x_i) << a(x_i), so the average intra-cluster distance is greater than the average nearest-cluster index and sample x_i is wrongly assigned. Viceversa, a value close to 1 indicates that the algorithm achieved a very good level of internal cohesion and inter-cluster separation (because a(x_i) << b(x_i)). Contrary to the other measure, the Silhouette score isn't a cumulative function and must be computed for each sample. A feasible strategy is to analyze the average value, but in this way, it's not possible to determine which clusters have the highest impact on the result. Another approach (the most common), is based on Silhouette plots, which display the score for each cluster in descending order. In the following snippet, we create plots for four different values of n_clusters (3, 5, 10, 12):

import matplotlib.pyplot as plt
import matplotlib.cm as cm

import numpy as np

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples

fig, ax = plt.subplots(2, 2, figsize=(15, 10))

nb_clusters = [3, 5, 10, 12]
mapping = [(0, 0), (0, 1), (1, 0), (1, 1)]

for i, n in enumerate(nb_clusters):
    km = KMeans(n_clusters=n, random_state=1000)
    Y = km.fit_predict(X_train)

    silhouette_values = silhouette_samples(X_train, Y)

    ax[mapping[i]].set_xticks([-0.15, 0.0, 0.25, 0.5, 0.75, 1.0])
    ax[mapping[i]].set_yticks([])
    ax[mapping[i]].set_title('%d clusters' % n)
    ax[mapping[i]].set_xlim([-0.15, 1])
    ax[mapping[i]].grid()
    y_lower = 20

    for t in range(n):
        ct_values = silhouette_values[Y == t]
        ct_values.sort()

        y_upper = y_lower + ct_values.shape[0]

        color = cm.Accent(float(t) / n)
        ax[mapping[i]].fill_betweenx(np.arange(y_lower, y_upper), 0, ct_values, facecolor=color, edgecolor=color)

        y_lower = y_upper + 20

The result is shown in the following graph:

Silhouette plots for different number of clusters

The analysis of a Silhouette plot should follow some common guidelines:

The width of each block must be proportional to the number of samples that are expected to belong to the corresponding cluster. If the label distribution is uniform, all the blocks must have a similar width. Any asymmetry indicates wrong assignments. For example, in our case, we know that the right number of clusters is ten, but a couple of blocks are thinner than the other ones. This means that a cluster contains fewer samples than expected and the remaining ones have been assigned to wrong partitions.
The shape of a block shouldn't be sharp and peaked (like a knife) because it means that many samples have a low Silhouette score. The ideal (realistic) scenario is made up of shapes similar to cigars with a minimum difference between the highest and lowest values. Unfortunately, this is not always possible to achieve, but it's always preferable to tune up the algorithm if the shapes are like the ones plotted in the first diagram (three clusters).
The maximum Silhouette score should be close to 1. Lower values (like in our example) indicate the presence of partial overlaps and wrong assignments. Negative values must be absolutely avoided (or limited to a very small number of samples) because they show a failure in the clustering process. Moreover, it's possible to prove that convex clusters (like K-means hyperspheres) lead to higher values. This is due to the properties of the commons distance functions (like the Euclidean distance) that can suggest a low internal cohesion whenever the shape of a cluster is concave (think about a circle and a half-moon). In this case, the process of embedding the shape into a convex geometry leads to a lower density and this negatively affects the Silhouette score.

In our particular case, we cannot accept having a number of clusters different from ten. However, the corresponding Silhouette plot is not perfect. We know the reasons for such imperfections (the structure of the samples and the high similarity of different digits) and it's quite difficult to avoid them using an algorithm like K-means. The reader can try to improve the performances by increasing the number of iterations, but in these cases, if the result doesn't meet the requirements, it's preferable to adopt another method (like the spectral clustering method, which can manage asymmetric clusters and more complex geometries).

Table of Contents for Silhouette score

Create new playlist

Sign In

Sign Up

Table of Contents for
Silhouette score