Hierarchical clustering

The hierarchical clustering algorithm is based on a dissimilarity measure between observations. A common measure, and what we will use, is Euclidean distance. Other distance measures are also available.

Hierarchical clustering is an agglomerative or bottom-up technique. By this, we mean that all observations are their own cluster. From there, the algorithm proceeds iteratively by searching all the pairwise points and finding the two clusters that are the most similar. So, after the first iteration, there are n-1 clusters, and after the second iteration, there are n-2 clusters, and so forth.

As the iterations continue, it is important to understand that in addition to the distance measure, we need to specify the linkage between the groups of observations. Different types of data will demand that you use different cluster linkages. As you experiment with the linkages, you may find that some create highly unbalanced numbers of observations in one or more clusters. For example, if you have 30 observations, one technique may create a cluster of just one observation, regardless of how many total clusters that you specify. In this situation, your judgment will likely be needed to select the most appropriate linkage as it relates to the data and business case.

The following table lists the types of common linkages, but note that there are others:

Linkage

Description

Ward

This minimizes the total within-cluster variance as measured by the sum of squared errors from the cluster points to its centroid.

Complete

The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster.

Single

The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster.

Average

The distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster.

Centroid

The distance between two clusters is the distance between the cluster centroids.

 

The output of hierarchical clustering will be a dendrogram, which is a tree-like diagram that shows the arrangement of the various clusters.

As we will see, it can often be difficult to identify a clear-cut breakpoint in the selection of the number of clusters. Once again, your decision should be iterative in nature and focused on the context of the business decision.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.218.45