Hierarchical clustering

To build a hierarchical cluster model in R, you can utilize the hclust() function in the base stats package. The two primary inputs needed for the function are a distance matrix and the clustering method. The distance matrix is easily done with the dist() function. For the distance, we will use Euclidean distance. A number of clustering methods are available, and the default for hclust() is complete linkage.

We will try this, but I also recommend Ward's linkage method. Ward's method tends to produce clusters with a similar number of observations.

The complete linkage method results in the distance between any two clusters, that is, the maximum distance between any one observation in a cluster and any one observation in the other cluster. Ward's linkage method seeks to cluster the observations in order to minimize the within-cluster sum of squares.

It is noteworthy that the R method ward.D2 uses the squared Euclidean distance, which is indeed Ward's linkage method. In R, ward.D is available but requires your distance matrix to be squared values. As we will be building a distance matrix of non-squared values, we will require ward.D2.

Now, the big question is how many clusters should we create? As stated in the introduction, the short, and probably not very satisfying, answer is that it depends. Even though there are cluster validity measures to help with this dilemma – which we will look at – it really requires an intimate knowledge of the business context, underlying data, and, quite frankly, trial and error. As our sommelier partner is fictional, we will have to rely on the validity measures. However, that is no panacea to selecting the numbers of clusters as there are several dozen validity measures.

As exploring the positives and negatives of the vast array of cluster validity measures is way outside the scope of this chapter, we can turn to a couple of papers and even R itself to simplify this problem for us. A paper by Miligan and Cooper, 1985, explored the performance of 30 different measures/indices on simulated data. The top five performers were CH index, Duda Index, Cindex, Gamma, and Beale Index. Another well-known method to determine the number of clusters is the gap statistic (Tibshirani, Walther, and Hastie, 2001). These are two good papers for you to explore if your cluster validity curiosity gets the better of you.

With R, we can use the NbClust() function in the NbClust package to pull results on 23 indices, including the top five from Miligan and Cooper and the gap statistic. You can see a list of all the available indices in the help file for the package. There are two ways to approach this process: one is to pick your favorite index or indices and call them with R; the other is to include all of them in the analysis and go with the majority rules method, which the function summarizes for you nicely. The function will also produce a couple of plots as well.

With the stage set, let's walk through the example of using the complete linkage method. When using the function, you will need to specify the minimum and maximum number of clusters, distance measures, and indices in addition to the linkage. As you can see in the following code, we will create an object called numComplete. The function specifications are for Euclidean distance, minimum number of clusters two, maximum number of clusters six, complete linkage, and all indices. When you run the command, the function will automatically produce an output similar to what you can see here—a discussion on both the graphical methods and majority rules conclusion:

> numComplete <- NbClust::NbClust(
    wine_df,
    distance = "euclidean",
    min.nc = 2,
    max.nc = 6,
    method = "complete",
    index = "all"
 )
*** : The Hubert index is a graphical method of determining the number of clusters.
 In the plot of Hubert index, we seek a significant knee that corresponds to a 
 significant increase of the value of the measure that is, the significant peak in Hubert
 index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
 In the plot of D index, we seek a significant knee (the significant peak in Dindex
 second differences plot) that corresponds to a significant increase of the value of
 the measure. 
******************************************************************* 
* Among all indices: 
* 1 proposed 2 as the best number of clusters 
* 11 proposed 3 as the best number of clusters 
* 6 proposed 5 as the best number of clusters 
* 5 proposed 6 as the best number of clusters 

 ***** Conclusion ***** 
* According to the majority rule, the best number of clusters is 3 
  
*******************************************************************

Going with the majority rules method, we would select three clusters as the optimal solution, at least for hierarchical clustering. The two plots that are produced contain two graphs each. As the preceding output states, you are looking for a significant knee in the plot (the graph on the left-hand side) and the peak of the graph on the right-hand side. This is the Hubert Index plot:

You can see that the bend or knee is at three clusters in the graph on the left-hand side. Additionally, the graph on the right-hand side has its peak at three clusters. The following Dindex plot provides the same information:

There are a number of values that you can call with the function and there is one that I would like to show. This output is the best number of clusters for each index and the index value for that corresponding number of clusters. This is done with $Best.nc. I've abbreviated the output to the first few indices:

> numComplete$Best.nc
                     KL      CH Hartigan   CCC    Scott
Number_clusters  5.0000  3.0000   3.0000 5.000   3.0000 
Value_Index     14.2227 48.9898  27.8971 1.148 340.9634

You can see that the first index, KL, has the optimal number of clusters as five and the next index, CH, has it as three.

With three clusters as the recommended selection, we will now compute the distance matrix and build our hierarchical cluster object. This builds the distance matrix:

> euc_dist <- dist(wine_df, method = "euclidean")

Then, we will use this matrix as the input for the actual clustering with hclust():

> hc_complete <- hclust(euc_dist, method = "complete")

The common way to visualize hierarchical clustering is to plot a dendrogram. We will do this with the functionality provided by the dendextend package:

> dend1 <- dendextend::color_branches(dend_complete, k = 3)

> plot(dend1, main = "Complete-Linkage")

The output of the preceding code is as follows:

The dendrogram is a tree diagram that shows you how the individual observations are clustered together. The arrangement of the connections (branches, if you will) tells us which observations are similar. The height of the branches indicates how much the observations are similar or dissimilar to each other from the distance matrix.

Here is the table of cluster counts:

> complete_clusters <- cutree(hc_complete, 3)

> table(complete_clusters)
complete_clusters
 1  2  3 
69 58 51

Out of curiosity, let's compare how this clustering algorithm compares to the cultivar labels:

> table(complete_clusters, wine$Class)
                 
complete_clusters  1  2  3
                1 51 18  0
                2  8 50  0
                3  0  3 48

In this table, the rows are the clusters and columns are the cultivars. This method matched the cultivar labels at an 84 percent rate. Note that we are not trying to use the clusters to predict a cultivar, and in this example, we have no a priori reason to match clusters to the cultivars, but it is revealing nonetheless.

We will now try Ward's linkage. This is the same code as before; it first starts with trying to identify the number of clusters, which means that we will need to change the method to Ward.D2:

> numWard <- NbClust::NbClust(
    wine_df,
    distance = "euclidean",
    min.nc = 2,
    max.nc = 6,
    method = "ward.D2",
    index = "all"
)
# Output abbreviated to just show the algorithm's conclusion.  
                     ***** Conclusion *****                            
    
    * According to the majority rule, the best number of clusters is 3

Once again, the majority rules were for a three-cluster solution. I'll let you peruse the plots on your own.

Let's move on to the actual clustering and production of the dendrogram for Ward's linkage:

> hc_ward <- hclust(euc_dist, method = "ward.D2")

> dend_ward <- as.dendrogram(hc_ward)

> dend2 <- dendextend::color_branches(dend_ward, k = 3)

> plot(dend2, main = "Ward Method")

This is the output:

The plot shows three pretty distinct clusters that are roughly equal in size. Let's get a count of the cluster size and show it in relation to the cultivar labels:

> ward_clusters <- cutree(hc_ward, 3)

> table(ward_clusters, wine$Class)
             
ward_clusters  1  2  3
            1 59  5  0
            2  0 58  0
            3  0  8 48

So, cluster one has 64 observations, cluster two has 58, and cluster three has 56. This method matches the cultivar categories closer than using complete linkage.

With another table, we can compare how the two methods match observations:

> table(complete_clusters, ward_clusters)
                  ward_clusters
complete_clusters  1  2  3
                1 53 11  5
                2 11 47  0
                3  0  0 51

While cluster three for each method is exact, the other two are not. The question now is how do we identify what the differences are for the interpretation? In many examples, the datasets are very small and you can look at the labels for each cluster. In the real world, this is often impossible. I like to aggregate results by cluster and compare accordingly.

Putting aggregated results by cluster into an interactive spreadsheet or business intelligence tool facilitates understanding by you and your business partners, helping to select the appropriate clustering method and number of clusters.

I'm going to demonstrate this by looking at the mean of the features grouped by the clusters from Ward's method. First, create a separate data frame with the scaled data (or original data, if you prefer) and the results:

> ward_df <- wine_df %>%
    dplyr::mutate(cluster = ward_clusters)

Now, do the aggregation:

> ward_df %>%
    dplyr::group_by(cluster) %>%
    dplyr::summarise_all(dplyr::funs(mean)) -> ward_results

You now can view that data frame in RStudio, or export it to your favorite BI tool. Maybe you are interested in a plot? If so, give this a try:

> ggplot2::ggplot(ward_results, ggplot2::aes(cluster, Alcohol)) +
    ggplot2::geom_bar(stat = "identity") +
    ggthemes::theme_stata()

This is the output:

A clear separation exists between the clusters in alcohol content. With that said, let's move on to k-means.

Table of Contents for Hierarchical clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Hierarchical clustering