An overview of HC algorithm and challenges

A hierarchical clustering technique is computationally different from the centroid-based clustering in the way the distances are computed. This is one of the most popular and widely used clustering analysis technique that looks to build a hierarchy of clusters. Since a cluster usually consists of multiple objects, there will be other candidates to compute the distance too. Therefore, with the exception of the usual choice of distance functions, you also need to decide on the linkage criterion to be used. In short, there are two types of strategies in hierarchical clustering:

  • Bottom-up approach: In this approach, each observation starts within its own cluster. After that, the pairs of clusters are merged together and one moves up the hierarchy.
  • Top-down approach: In this approach, all observations start in one cluster, splits are performed recursively, and one moves down the hierarchy.

These bottom-up or top-down approaches are based on the single-linkage clustering (SLINK) technique, which considers the minimum object distances, the complete linkage clustering (CLINK), which considers the maximum of object distances, and the unweighted pair group method with arithmetic mean (UPGMA). The latter is also known as average-linkage clustering. Technically, these methods will not produce unique partitions out of the dataset (that is, different clusters).

A comparative analysis on these three approaches can be found at https://nlp.stanford.edu/IR-book/completelink.html.

However, the user still needs to choose appropriate clusters from the hierarchy for better cluster prediction and assignment. Although algorithms of this class like bisecting K-means are computationally faster than the K-means algorithm, there are three disadvantages to this type of algorithm:

  • First, these methods are not very robust toward outliers or datasets containing noise or missing values. This disadvantage either imposes additional clusters or even causes other clusters to merge. This problem is commonly referred to as the chaining phenomenon, especially for single-linkage clustering.
  • Second, from the algorithmic analysis, the complexity is for agglomerative clustering and for divisive clustering, which makes them too slow for large data sets.
  • Third, SLINK and CLINK were previously used widely in data mining tasks as theoretical foundations of cluster analysis, but nowadays they are considered obsolete.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.8.127