Learning by doing – partition clustering with kmeans()

Perhaps the most widely used clustering family of algorithms is k-means. In this section, we will examine how it works and ways to assess the quality of a clustering solution.

K-means is a partitioning algorithm that produces k (user-defined number) clusters of cases that are more similar to each other than to cases outside the cluster. K-means starts by randomly initiating the centroid (the value of the considered dimensions) of each cluster. From now, the process, aiming at creating homogenous clusters, is iterative until a final solution is found. For each case, the distance from the centroid of each cluster is computed, and cases are assigned to the closest cluster. After this step, k-means computes the new values of the centroid of each cluster, as the means of all the cases belonging to the cluster. The process stops when the distance between the cases and the centroid is not decreasing anymore. It is noteworthy that the final result at convergence depends upon the initial random values of the centroids. For this reason, it is advocated to configure kmeans() to repeat the process several times and select the best solution (using the nstart argument). The figure below illustrates this process. We will only use the default algorithm with kmeans(), but the function provides several to choose from—Hartigan-Wong, Lloyd, Forgy, and MacQueen—using the algorithm argument. We will discuss this later. The interested reader can inquire about the differences using the references provided in the ad-hoc R manual page (type ?kmeans in the console).

Learning by doing – partition clustering with kmeans()

Steps in the k-means algorithm family

In what follows, we will build our own k-means implementation in order to better understand its functioning. There is a slight difference in our implementation in that we do not initiate the centroids randomly. Instead we will attribute membership to the clusters randomly, and then compute the centroids. This is equivalent, but easier programmatically. This implementation is intended only for pedagogical purposes. It lacks several important features, and sometimes converges non-optimally. For proper analyses, please use kmeans() or an implementation from another package available on CRAN.

Setting the centroids

First, we need a function that attributes membership to the clusters randomly. We do this for all the observations at the same time, and return the result as vector clusters (see line 2). Our function takes 2 arguments: the number of observations (numrows) and the number of clusters (k).

1  set.random.clusters = function (numrows, k) {
2     clusters = sample(1:k, numrows, replace=T)
3  }

We then create a function that computes the centroids—that is, the means for each case on each dimension and each cluster. Our function takes 2 arguments: the data frame on which to cluster the data (df), and the current cluster assignments (clusters).

1  compute.centroids = function (df, clusters) {
2      means = tapply(df[,1], clusters, mean)
3      for (i in 2:ncol(df)) {
4          mean.case = tapply(df[,i], clusters, mean)
5          means=rbind(means, mean.case)
6      }
7      centroids = data.frame(t(means))
8      names(centroids) = names(df)
9      centroids
10  }

On line 2, we assign to vector means the mean values of attribute in column 1 for each of the clusters. In a loop, we assign to vector mean.case the values of attribute in column i for each cluster, and append it to means (lines 4 and 5). We then make a data frame from the transpose of object means and assign it to object centroids (line 8) We name the columns of this object as the columns in the original data frame (line 9). Finally we return centroids (line 9).

Computing distances to centroids

We then need a function that computes the distance between the actual values and the centroid of the clusters. This function takes the data and the centroids as arguments. It first creates a blank matrix that will contain the distances (line 2). It then loops over the cases and the clusters (see the code blocks started on lines 3 and 4) and computes the squares of the differences for the current case and current cluster, computes the square root of those, and assigns the value to the correct spot in the matrix (line 5). It finally returns the matrix of Euclidean distances (line 8).

1  euclid.sqrd = function (df, centroids) {
2      distances = matrix(nrow=nrow(df), ncol=nrow(centroids))
3      for (i in 1:nrow(df)) {
4          for (j in 1:nrow(centroids)) {
5              distances[i,j] = sum((df[i,]-centroids[j,])^2)
6          }
7      }
8  distances
9  }

Computing the closest cluster for each case

We now need a function that will compare, for each case, its distance to each of the clusters and select the cluster where this value is minimum—that is, assign the case to a cluster based on this comparison. For this purpose, on line 2, we simply use the which.min() function, which indicates in which column the minimal value is, as an argument of the apply() function, which applies a function to each row of its input. We wrap this in the cbind() function, which allows it to return column-shaped vectors. If the number of returned clusters is less than expected (tested on line 3), we restart the process by assigning cases to random clusters, using the set.random.clusters() function (line 5). In other words, we take the precaution of restarting the process of setting the centroids randomly if we find an empty cluster.

1  assign= function (distances) {
2     clusters=data.frame(cbind(c(apply(distances, 1, which.min))))
3     if(nrow(unique(clusters))<ncol(distances)){
4     #precaution in case of empty cluster
5     clusters=set.random.clusters(nrow(distances),ncol(distances))
6  }
7  clusters
8  }

Tasks performed by the main function

We now almost have everything we need for our basic k-means implementation. We finally need to wrap this together in a main function, which we call kay.means(). Here we set initial cluster value (line 2) and then iterate over the computation of centroids (line 7), the calculation of distances (line 8), and the re-assignment of clusters (line 11). Notice the code block is contained in a while loop, which stops when the sum of squares of the distances (that is the total sum of squares within clusters) is the same twice in a row (when ss.old equal to ss – this value is set on line 10), and output the clustering solution (line 14).

1  kay.means = function (df, k) {
2      clusters = set.random.clusters(nrow(df),k)
3      ss.old = 1e100
4      ss = 1e99
5      while(ss!=ss.old) {
6     
7          centroids = compute.centroids(df, clusters)
8          distances = euclid.sqrd(df, centroids)
9          ss.old=ss
10          ss = sum(distances)
11          clusters = assign(distances)
12      }
13      names(clusters) = "Clusters"
14      clusters
15  }

Let's try this using a very popular dataset, which we have already encountered in the previous chapter: the iris dataset, where observations are 150 iris flowers. Attributes are the species of the flowers and their petal and sepal length and width. So, in this case, we know the groups beforehand and are interested in knowing whether k-means can predict it from the other attributes. Let's start by having a look at the correct classification, which is in the 5th column of the data set.

iris[5]

We will not display the full output here, but you will see in your console that cases 1 to 50 are of species Setosa, cases 51 to 100 are of species Versicolor, and cases 101 to 150 of species Virginica.

We will use our knowledge of the dataset to determine the number of clusters. We select three clusters as we know there are three species. We will talk later about determining the number of clusters when this information is not available.

Internal validation

Our goal now is to assess the quality of our clustering solution. We are lucky, as we have the right answer already (which is not always the case). In order to check how well our clustering solution did, we first append a column to the iris dataset with the clustering solution of our implementation of k-means, and observe the convergence between the clustering and the species by creating a cross-tabulation of the data.

set.seed(1)
irisClust = cbind(iris, kay.means(iris[1:4], 3))
tableClust=table(unlist(irisClust[5]), unlist(irisClust[6]))
tableClust

The output is provided here:

Internal validation

Our implementation of k-means did a good job on this dataset: flowers of the Setosa species are classified in cluster 3, flowers of the Versicolor species are almost all classified in cluster 2, while flowers of the Virginica species are mostly classified in cluster 1, with a larger degree of misclassifications in cluster 2 (about a third). As we have information about the correct solution, we can compute indices of the correctness of our cluster solution. For instance, we can compute Cohen's kappa, which assesses the agreement of our clustering solution with the correct one. See the paper A coefficient of agreement for nominal scales, by Cohen (1960) for a description of the measure. The closer its value is to 1, the better the agreement. This index is in the psych package, which needs installing and loading before we proceed with the analysis.

install.packages("psych"); library(psych)

The index requires that categories are given the same name in both the classification and the original solution. We therefore need to recode our data to proceed:

irisClust = cbind(irisClust, rep(0,nrow(irisClust)))
names(irisClust[7]) = "Species.recode"
irisClust[7][irisClust[5]=="setosa"] = 3
irisClust[7][irisClust[5]=="versicolor"] = 2
irisClust[7][irisClust[5]=="virginica"] = 1

We now have the data of both our clustering solution (in column 6) and the correct solution (in column 7) in the same format. We can now apply Cohen's kappa to our data:

kappa=cohen.kappa(cbind(irisClust[6],irisClust[7]))

The value of Cohen's kappa, 0.83 for the unweighted kappa (which we rely upon), shows good agreement, whereas the weighted value shows even better agreement:

Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha)

Cohen Kappa and Weighted Kappa correlation coefficient and confidence boundary output is as follows:

Internal validation

We can also compute the rand index, which is another measure of agreement. We can find it in the flexclust package. Please install and load it before you proceed:

install.packages("flexclust"); library(flexclust)

Here we can simply reuse the cross-tabulation we produced earlier:

randIndex(tableClust)

The value of the index is .716, indicating also a good agreement. Try using the kay.means() function with a different seed, such as 3 for instance, and notice the differences.

While clustering solutions using k-means are always dependent on initial centroid values, our implementation is even more vulnerable. It is always a good idea to perform a cluster analysis with different seeds, which we will do later using the nstart argument with kmeans(). Nevertheless, our custom function fulfilled its purpose, which is to give the reader a sense of how k-means works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.82.21