Hierarchical clustering

The hierarchical clustering algorithm is based on a dissimilarity measure between observations. A common measure, and what we will use, is Euclidean distance. Other distance measures are also available.

Hierarchical clustering is an agglomerative or bottom-up technique. By this, we mean that all observations are their own cluster. From there, the algorithm proceeds iteratively by searching all the pairwise points and finding the two clusters that are the most similar. So, after the first iteration, there are n-1 clusters, and after the second iteration, there are n-2 clusters, and so forth.

As the iterations continue, it is important to understand that in addition to the distance measure, we need to specify the linkage between the groups of observations. Different types of data will demand that you use different cluster linkages. As you experiment with the linkages, you may find that some create highly unbalanced numbers of observations in one or more clusters. For example, if you have 30 observations, one technique may create a cluster of just one observation, regardless of how many total clusters that you specify. In this situation, your judgment will likely be needed to select the most appropriate linkage as it relates to the data and business case.

The following table lists the types of common linkages, but note that there are others:

Linkage	Description
Ward	This minimizes the total within-cluster variance as measured by the sum of squared errors from the cluster points to its centroid.
Complete	The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster.
Single	The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster.
Average	The distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster.
Centroid	The distance between two clusters is the distance between the cluster centroids.

The output of hierarchical clustering will be a dendrogram, which is a tree-like diagram that shows the arrangement of the various clusters.

As we will see, it can often be difficult to identify a clear-cut breakpoint in the selection of the number of clusters. Once again, your decision should be iterative in nature and focused on the context of the business decision.

Table of Contents for Hierarchical clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Hierarchical clustering