Chapter 5. Agglomerative Clustering Using hclust()

Unlike partition clustering, which requires the user to specify the number of k clusters and create homogeneous k groups, hierarchical clustering defines clusters without user intervention from distances in the data and defines a tree of clusters from this. Hierarchical clustering is particularly useful when the data is suspected to be hierarchical (leaves nested in nodes nested in higher level nodes). It can also be used to determine the number of clusters to be used in k-means clustering. Hierarchical clustering can be agglomerative or divisive. Agglomerative clustering usually yields a higher number of clusters, with less leaf nodes by cluster.

Agglomerative clustering refers to the use of algorithms, which start with a number of clusters that is equal to the number of cases (each case being a cluster) and merges clusters iteratively one by one, until there is only one cluster that corresponds to the entire dataset. Divisive cluster is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on, until each case is a cluster. In this chapter, we will discuss agglomerative clustering. The reader might be interested in consulting the paper Hierarchical clustering schemes by Johnson (1967).

In this chapter, we will cover the following:

  • The inner working of agglomerative clustering
  • The use of hclust() for agglomerative clustering with numerical attributes
  • The use of hclust() with binary attributes

The inner working of agglomerative clustering

As briefly mentioned, agglomerative clustering refers to algorithms. Let's start with the example of the data we used last in the previous chapter:

1  rownames(life.scaled) = life$country
2  a=hclust(dist(life.scaled))
3  par(mfrow=c(1,2))
4  plot(a, hang=-1, xlab="Case number", main = "Euclidean")

We started by adding the name of each country as the row name of the related case (line 1), in order to display it on the graph. The function hclust() was then used to generate a hierarchical agglomerative clustering solution from the data (line 2). The algorithm uses a distance matrix, provided as an argument (here the default is the Euclidean distance) to determine how to create a hierarchy of clusters. We have discussed measures of distance in the previous chapter. Please refer to this explanation if in doubt. Finally, the hclust object a at line 2 was plotted in a dendrogram (line 4 in the following diagram). At line 3, we set the plotting area to include two plots.

The following figure (left panel), displays the clustering tree. At the beginning of the creation of the cluster hierarchy (or tree), each case is a cluster. Clusters are joined one by one on the basis of the distance between them (represented by the vertical lines); clusters with smaller distances are selected for merging. As can be seen from the distances plotted in the figure, hclust() first joined Panama and Costa Rica in a new cluster. It then aggregated Canada and US (White) in a cluster, as they were the next cases with the smallest Euclidean distance, and it also did so for Grenada and Trinidad, Mexico and Columbia, and El Salvador and Ecuador. Next, it joined the cluster formed by Grenada and Trinidad with Jamaica. Clusters with the next smaller distance were then US (Nonwhite) and Chile, which were merged together. Then the cluster Canada and US (White) was merged with Argentina. The algorithm continued merging clusters until only one remained. It is quite interesting that hclust() produces life expectancy clusters that are closely related to the distance between countries; countries that are close geographically generally have similar life expectancies.

The inner working of agglomerative clustering

Dendrograms of remaining life expectancy in several countries (Euclidean distance)

As we have mentioned, we used hclust() with the default (Euclidean distance) matrix in the previous example. Results can be different, although usually not dramatically, using other distance metrics. We have no reason to infer that the shortest distance between the points is not the best measurement here, but to illustrate the potentially different results using different measure distances, we will be using the Manhattan distance instead of Euclidean distance. On line 1 of the following code, we assign the hclust object returned by hclust() with argument configuration manhattan, which sets the distance metric. We then simply plot the resulting tree. The hang is set to a negative value in order to get the distance that is to be plotted from 0:

1  a = hclust(dist(life.scaled, method= "manhattan"))
2  plot(a, hang=-1, xlab="Case number", main = "Manhattan")

The preceding figure (right panel) displays the result of our aggregative cluster analysis with hclust(). We can notice that, although there are many similarities, the clustering solution is not the same, as compared to an Euclidean distance. We can see that Madagascar and Cameroon form a cluster that cannot be joined to the other clusters until reaching the last step in both solutions. Also, in both solutions, the US (White) and Canada are very similar to each other (form a cluster quite fast), as well as Costa Rica and Panama. But let's have a look at Algeria. In the Euclidean distance solution, it forms a cluster with Nicaragua, but in the Manhattan distance solution, it clusters with Nicaragua and Dominican Republic. Seychelles forms a cluster with Tunisia in the Euclidean distance solution, but they cluster with South Africa, Argentina, Canada, and US in the Manhattan distance solution.

Indeed, the choice of the distance matrix is not without incidence on the results. It has to be chosen as a function of the data (what they represent). Choosing the method to determine cluster proximity also plays an important role in agglomerative clustering. hclust() provides several methods including single linkage, complete linkage, average linkage, and Ward's minimum variance. The single linkage method computes the proximity of clusters as the smallest distance between the points of the clusters. The complete linkage method computes it as the maximum distance between points of the cluster; average linkage uses the average distance of the points from one cluster to the points of the other. Finally, Ward minimum variance agglomerates clusters by minimizing the variance around centroids in the resulting clusters. Note that Ward's minimum variance method requires a squared Euclidean distance matrix.

These methods have different properties (The R Core Team, 2013, p.1301):

"Ward's minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a 'friends of friends' clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods."

We will examine differences in results later in the chapter.

Before we proceed, let's examine, this time visually, how the algorithm proceeds using a simple fictitious dataset. As visual representations are limited to three dimensions, we will only use three attributes, but the computation is similar with more attributes. We will display these using the scatterplot3d() function of the plot3D package, which we will install and load after creating the attributes. We then examine the clustering solution provided by hclust() in order to assess whether it confirms the impressions we get from visual inspection:

A1 = c(2,3,5,7,8,10,20,21,23)
A2 = A1
A3 = A1

install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(A1,A2,A3, angle = 25, type = "h")

demo = hclust(dist(cbind(A1,A2,A3)))
plot(demo)
The inner working of agglomerative clustering

A 3D scatterplot and dendrogram exemplifying agglomerative clustering

As can be noticed on the left panel of the previous screenshot, there are three groups of two points that are very close to each other. Another point is quite close to each of these groups of two. Consider that the groups of two constitute a group of three with the points that lie closest to them. Finally, the two groups on the left are closer to each other than they are to the group of three on the right. If we have a look at the dendrogram, we can see that the very same pattern is visible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.67.16