Visualization – dendrograms

Hierarchical clustering provides insight into degrees of similarity among observations as it continues to merge data. A significant change in the similarity metric from one merge to the next suggests a natural clustering existed prior to this point.

The dendrogram visualizes the successive merges as a binary tree, displaying the individual data points as leaves and the final merge as the root of the tree. It also shows how the similarity monotonically decreases from bottom to top. Hence, it is natural to select a clustering by cutting the dendrogram.

The following screenshot (see the hierarchical_clustering notebook for implementation details) illustrates the dendrogram for the classic Iris dataset with four classes and three features, using the four different distance metrics introduced precedingly:

It evaluates the fit of the hierarchical clustering using the cophenetic correlation coefficient, which compares the pairwise distances among points and the cluster similarity metric at which a pairwise merge occurred. A coefficient of 1 implies that closer points always merge earlier.

Different linkage methods produce dendrograms of different appearance, so we cannot use this visualization to compare results across methods. In addition, Ward's method, which minimizes within-cluster variance, may not properly reflect the change in variance, but rather the total variance, which may be misleading. Instead, other quality metrics such as cophenetic correlation, or measures such as inertia (if aligned with the overall goal), may be more appropriate.

The strengths of clustering include:

  • You do not need to specify the number of clusters
  • It offers insight about potential clustering by means of an intuitive visualization
  • It produces a hierarchy of clusters that can serve as taxonomy
  • It can be combined with k-Means to reduce the number of items at the start of the agglomerative process

Weaknesses of hierarchical clustering include:

  • The high cost in terms of computation and memory because of the numerous similarity matrix updates
  • It does not achieve the global optimum because all merges are final
  • The curse of dimensionality leads to difficulties with noisy, high-dimensional data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.237.44.242