Agglomerative clustering with hclust()

In what follows, we are going to explore the use of agglomerative clustering with hclust() using numerical and binary data in two datasets.

Exploring the results of votes in Switzerland

In this section, we will examine the case of another dataset. This dataset represents the percentage of acceptance of the themes of federal (national) voting objects in Switzerland in 2001. The first rows of data are in the following table. The rows represent the cantons (the Swiss name for states). The columns (except the first) represent the topic of the voting. The values are the percentage of acceptance of the topic of voting. The data has been retrieved from the Swiss Statistics Office (www.bfs.admin.ch) and are provided in the folder for this chapter (file swiss_votes.dat).

Exploring the results of votes in Switzerland

The first five rows of the dataset

To load the data, save the file in your working directory or change the working directory (use setwd() to the path of the file) and type the following line of code. Here, the sep argument is set to indicate that tabulations are used as separators, and the header argument is set to T (true), meaning that the provided data has column headers (attribute names):

swiss_votes = read.table("swiss_votes.dat", sep = "	",
   header = T)

Here we are interested in knowing whether, in year 2001, we can find clusters in the voting behavior of the populations of the cantons. We will perform the analysis repetitively using the three methods discussed previously and examine the potential differences.

After computing the distance matrix (line 1), we start with the default method (complete linkage as on line 2), and then the single (line 3) and average (line 4) methods. On line 5, we set the plotting area to contain three vertical graphs on line 4, and proceed to plot the graphs on lines 6 to 14:

1  dist_matrix = dist(swiss_votes[2:11])
2  clust_compl  = hclust(dist_matrix)
3  clust_single  = hclust(dist_matrix, method = "single")
4  clust_ave  = hclust(dist_matrix, method = "average")
5  par(mfrow = c(3,1))
6  plot(clust_compl, labels=swiss_votes$Canton, hang = -1,
7     main = "Complete linkage", xlab = "Canton",
8     ylab = "Distance")
9  plot(clust_single, labels=swiss_votes$Canton, hang = -1,
10     main = "Single linkage", xlab = "Canton",
11     ylab = "Distance")
12  plot(clust_ave, labels=swiss_votes$Canton, hang = -1,
13     main = "Average linkage", xlab = "Canton",
14     ylab = "Distance")

It can be seen that, as mentioned by the R Core Team (2013), the Complete linkage method produces compact clusters, whereas the Single linkage function is more inclusive. The Average linkage function produces a clustering solution that lies somewhere in between. What is also interesting to notice is that points (here cantons) that are grouped together using one method are not necessarily grouped together using another method. The choice of the clustering solution is therefore important in analyzing the data.

One way to proceed to such a choice is based on the interpretability of the results and domain knowledge. One thing that explains shared opinions is geographical proximity, as people who are close communicate more frequently and thereby share ideas (Latané, 1996). The clustering solution which relies on Complete linkage better exemplifies the impact of geographical proximity (take a look at a map of Switzerland), which allows interpreting the results using this rationale. We will therefore investigate the differences between clusters more precisely in this solution.

Exploring the results of votes in Switzerland

Dendrograms of Swiss votes

We can see, on the corresponding cluster tree for Complete linkage, that that there are four subgroups. The first is composed of NW, OW, AI, SZ; the second of TI, SH, GL, UR, TG, AR, and SG; and so on. We know that most of the cantons composing a cluster are close, geographically, but we don't know exactly what this means in relation to the data. We now want to examine the differences between the clusters in terms of voting behavior. The first thing we want to do now is assign the cantons to a cluster on the basis of the clustering solution with the Complete linkage method. We determined that four clusters are good for this data, after examining the clustering tree. We now want to display the cluster assignment:

clusters=cutree(clust_compl, k = 4)
cbind(clusters,swiss_votes[1])

The output is provided below:

Exploring the results of votes in Switzerland

We can see that cluster 1 corresponds to the last cluster, the subcluster on the dendrogram, cluster 2 to the first, cluster 3 to the second, and cluster 4 to the third (on the diagram).

We will create a scatterplot with the acceptance of two votes in order to examine these patterns. The first vote, of which the acceptance rates are depicted on the x axis (called Protection in the data frame), was related to the implementation of a protection and support service that would aim at guaranteeing peace within the country. The second vote, of which the acceptance rates are depicted on the y axis (called Taxes2 in the data frame), was related to the implementation of an environmental tax on nonrenewable energy. Both themes were rejected by the population in all cantons. Yet, there are important differences in the extent of the rejection.

We plot the graphic using the following code. Notice the order of the colors in order to understand which cluster they represent:

plot(swiss_votes$Protection,swiss_votes$Taxes2,
   pch=15, col=gray.colors(4)[clusters],
   xlab="Protection and support services for peace",
   ylab = "Tax on non-renewable energy")
Exploring the results of votes in Switzerland

A scatterplot of 2 votes, by cluster

Generally, the agreement to both votes are related; the more the population of a canton agrees with one of the votes, the more it will agree with the other. We can notice that the cluster on the left (cluster 2 in the preceding output) has the lowest agreement in the case of both votes, and the cluster on the right (cluster 4 in the output), has the higher agreement in the case of both votes as well. The other two clusters are in between in the case of these two votes. We think this is an interesting visual representation, but wouldn't it be nice to have a table with the mean agreement for each vote and cluster? This is very easy to obtain:

round(aggregate(swiss_votes[2:11], list(clusters), mean),1)

The output is provided in the following figure:

Exploring the results of votes in Switzerland

One can notice that the second cluster (corresponding to the first cluster on the left on the scatterplot) has a mean value that is generally smaller than the other clusters. We can see that the agreements in the other clusters vary as a function of the topic of the votes, with cluster 4 being generally less conservative. The interested reader can find more about Swiss politics on the following page: http://www.admin.ch/org/polit/00054/index.html?lang=en.

The use of hierarchical clustering on binary attributes

The previous datasets that we have used are composed of numerical attributes. Data is sometimes composed of binary attributes. By binary, we mean that there are only two possible modalities for the attribute. Examples of such attributes include characteristics such as gender (woman/man), currently married (yes/no), and organization type (private/public).

The distance metric to be provided to hclust() has to take the nature of the attributes into account; that is, it must be computed accordingly. The binary distance is the required type for such data. A distance matrix containing the binary distance can be obtained using the following line of code, where df is the data frame on which to compute the distance:

dist(df, method="binary")

Here we will use the Trucks dataset from the vcd package. Install and load it (as well as the data) using the following code:

install.packages("vcd")
library(vcd)
data(Trucks)
head(Trucks)

The few lines of the data displayed here (requested on the last line of the preceding code) allow us to notice that it requires frequency weighting:

The use of hierarchical clustering on binary attributes

For instance, the first row represents 712 cases of truck accidents that occurred during the day before a new safety policy was implemented, with a collision at the back of the vehicle, when it was parked. We want 712 cases instead of one row. This can be done using the following code:

Trucks.wd<- Trucks[rep(1:nrow(Trucks),Trucks$Freq),]

For further analyses, we will remove attributes Freq (number 1) and light (number 5):

Trucks.rm = Trucks.wd[, -(c(1,5))]

As the number of cases makes the process very slow, and also for visibility reasons, we will randomly sample 100 cases only:

set.seed(456)
Trucks.sample = Trucks.rm[sample(nrow(Trucks.rm), 100), ]

We also need to set one value to off (0), for each attribute. All other values will be considered on (1):

1  Trucks.onoff = data.frame(matrix(nrow =
2     nrow(Trucks.sample), ncol = ncol (Trucks.sample)))
3  for (i in 1:nrow(Trucks.sample)) {
4     for (j in 1:ncol(Trucks.sample)) {
5        if (Trucks.sample[i,j] != Trucks.sample[1,j])
6            Trucks.onoff[i,j] = 0
7        else Trucks.onoff[i,j] = 1
8     }
9  }
10  names(Trucks.onoff)=names(Trucks.sample)

We can now proceed with the analysis and plot the dendrogram:

b = hclust(dist(Trucks.onoff, method= "binary"))
plot(b)

As can be seen in the following figure (because of random sampling, your figure might look different, but what follows should be valid anyway), there are lots of cases that have a distance of 0 in the figure, as we didn't configure the plot to be aligned on 0; this means that it is the actual distance of the cases. As the distance is often 0, hclust() didn't proceed to the usual grouping by pair that we have seen in the previous examples. One can further see that most cases are part of clusters that are relatively distant from one another, which makes sense from observing the data.

The use of hierarchical clustering on binary attributes

A dendrogram of truck accidents (binary data)

Even though we have randomly selected 100 cases only, the dendrogram is still quite dense. For this reason, we will only comment on the left part. So let's have a look at cases 96, 92, 76, 52, 5, and 23. In the output, we can see that all cases share the same values on these attributes:

Trucks.onoff[c(96,92,76,52,5,23),]

The output is provided here:

The use of hierarchical clustering on binary attributes
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.4.191