Evaluating cluster quality

Cluster quality metrics help select among alternative clustering results. The kmeans_evaluation notebook illustrates the following options:

  1. The k-Means objective function suggests we compare the evolution of the inertia or within-cluster variance.
  2. Initially, additional centroids decrease the inertia sharply because new clusters improve the overall fit.
  3. Once an appropriate number of clusters has been found (assuming it exists), new centroids reduce the within-cluster variance by much less as they tend to split natural groupings.
  4. Hence, when k-Means finds a good cluster representation of the data, the inertia tends to follow an elbow-shaped path similar to the explained variance ratio for PCA, as shown in the following screenshot (see notebook for implementation details): 

The silhouette coefficient provides a more detailed picture of cluster quality. It answers the question: how far are the points in the nearest cluster, relative to the points in the assigned cluster?

To this end, it compares the mean intra-cluster distance (a) to the mean distance of the nearest cluster (b) and computes the following score s:

The score can vary from between -1 and 1, but negative values are unlikely in practice because they imply that the majority of points are assigned to the wrong cluster. A useful visualization of the silhouette score compares the values for each data point to the global average because it highlights the coherence of each cluster relative to the global configuration. The rule of thumb is to avoid clusters with mean scores below the average for all samples.

The following screenshot shows an excerpt from the silhouette plot for three and four clusters, where the former highlights the poor fit of cluster 1 by sub-par contributions to the global silhouette score, whereas all of the four clusters have some values that exhibit above average scores:

In sum, given the usually unsupervised nature, it is necessary to vary the hyperparameters of the cluster algorithms and evaluate the different results. It is also important to calibrate the scale of the features, in particular when some should be given a higher weight and should thus be measured on a larger scale.

Finally, to validate the robustness of the results, use subsets of data to identify whether particular patterns emerge consistently.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.220.182.198