Using k-means with public datasets

In what follows, we are going to learn more about partition clustering with k-means while exploring a dataset from the cluster.datasets package. This package contains datasets that were published in the book, Clustering algorithms, by Hartigan (1975), with examples of analyses. So let's start by installing this dataset on your machine, and loading it.

install.packages("cluster.datasets")
library(cluster.datasets)

Understanding the data with the all.us.city.crime.1970 dataset

We will first focus on getting to know the data, scaling the data to a common metric, and cluster interpretability. Our first exploration will concern the crime rates among different US cities in 1970. The dataset all.us.city.crime.1970 affords such investigation:

data(all.us.city.crime.1970)
crime = all.us.city.crime.1970

Let's investigate the attributes in the dataset:

ncol(crime)
names(crime)
summary(crime)

There are 10 attributes. A look at the R manual page (type ?all.us.city.crime.1970) allows us to understand what these variables are about. Most of them are pretty obvious considering their name, and we will not comment further here. Looking at the descriptive statistics, one can notice that there was a quite important number of crimes in the 24 cities for which data is available in this dataset: summing over murder, rape, robbery, assault, burglary, and car.theft, around 2,500 crimes took place per 100,000 residents, which means that about 2.5 percent of the population was the victim of a crime that year (considering that one person could only be a victim of one crime). It might be interesting to know if cities differ in relation to the crimes that are committed. We will manually explore several clustering solutions. We will only consider here dimensions related to crime, which is attributes 5 to 10. Before we run kmeans(), let's have a look at the relationship between the attributes.

plot(crime[5:10])

The resulting image is not displayed here because it will be updated some lines below. As you can see on your screen, there is visibly a strong positive association between the rate of some crimes (such as burglary and rape), and a weaker for others (such as murder and burglary). Overall it seems that the more of one crime type is committed, the more the others are as well. We can confirm this intuition looking at the correlation matrix (rounded to three decimals).

round(cor(crime[5:10]),3)

Here is the output:

Understanding the data with the all.us.city.crime.1970 dataset

Yet, the relatively modest values of some correlations permits to imagine a specialization of crime in some cities.

We will run kmeans() on this dataset with an increasing number of clusters (from 2 to 5), and will examine to solutions visually and concurrently. We will have a detailed look at the output of only the first and last clustering models, at the end. We will let the reader modify the code with regards to the number of clusters. We could have implemented a loop to do this, but we think it is more interesting if you have a look at each solution individually at your pace. In all our models, we will ask k-means to repeat the procedures 25 times (using argument nstart) in order to be sure to have a good clustering solution. We will of course start by standardizing our data, in order to avoid one attribute that is more important than the others in computing the distances.

1  crime.scale = data.frame(scale(crime[5:10]))
2  set.seed(234)
3  TwoClusters = kmeans(crime.scale, 2, nstart = 25)
4  plot(crime[5:10],col=as.factor(TwoClusters$cluster), 
5  main = "2-cluster solution")
6  ThreeClusters = kmeans(crime.scale, 3, nstart = 25)
7  plot(crime[5:10],col=as.factor(ThreeClusters$cluster), 
8    main = "3-cluster solution")
9  FourClusters= kmeans(crime.scale, 4, nstart = 25)
10  plot(crime[5:10],col=as.factor(FourClusters$cluster),
11     main = "4-cluster solution")
12  FiveClusters = kmeans(crime.scale, 5, nstart = 25)
13  plot(crime[5:10],col=as.factor(FiveClusters$cluster), 
14  main = "5-cluster solution")

Understanding the data with the all.us.city.crime.1970 dataset

The relationship between several types of crimes and cluster membership for k=2 to k=5

An important aspect of cluster analysis is the interpretation of the clusters. As can be seen in the preceding screenshot, the interpretation of the clusters in the 2-cluster solution is quite straightforward. Cities with a low criminality make up the black cluster, whereas the red cluster is composed of cities with higher criminality.

The pattern is more complex in the model with three clusters. At first sight, it seems that this cluster is about a low average and high criminality. But this is denied by a closer inspection: burglary and car.theft can be high in the green cluster, rape and murder can be low to average, while assault and robbery are low. The black cluster seems to be concerned with cities with average crime. But looking more closely, murder can be higher in this cluster than in the red one; this is true to a lesser extent for rape and car.theft. We could consider this cluster as representing cities with a high murder rate and an average rate of other crimes. The red cluster is the most dispersed of the three, yet it is the easiest to interpret. Cities in this cluster have average to high values for all the study's dimensions of crime. The solutions with four and five clusters are even more difficult to interpret. It is usually advised to consider a number of clusters manageable for interpretation (not hundreds of clusters) and that are meaningful, even if a larger number of clusters explains the data better.

Let's now examine the textual output of R for our first (TwoClusters) solution.

TwoClusters

Here is the output:

K-means clustering with 2 clusters of sizes 11, 13
Cluster means:
      murder       rape    robbery    assault   burglary  car.theft
1 -0.9128346 -0.6991864 -0.8438639 -0.8328348 -0.5708682 -0.7166146
2  0.7723985  0.5916192  0.7140387  0.7047064  0.4830424  0.6063662
Clustering vector:
 [1] 1 2 1 1 2 1 2 2 2 2 2 2 1 1 2 2 1 1 1 2 2 1 1 2
Within cluster sum of squares by cluster:
[1] 18.39421 47.16265
 (between_SS / total_SS =  52.5 %)
Available components:
[1] "cluster"      "centers"    "totss"     "withinss"    "tot.withinss"
[6] "betweenss"    "size"       "iter"      "ifault"      

The Cluster means reports the centroids for the final iteration of the algorithm, usually when convergence is achieved. This information confirms our visual interpretation of the clustering solution—one factor has high means on all crime dimensions, whereas the other has low means. This section is directly accessible as data by typing: TwoClusters$centers.

The Clustering vector reports on the membership of the observations to each of the clusters—for instance, the first observation is part of cluster 2 (low criminality), whereas the last is part of cluster 1 (average to high criminality). This section is directly accessible as data by typing: TwoClusters$cluster.

The section Within cluster sum of squares by cluster reports on the overall squared distance between the data points and their centroid, within each of the clusters. We can also see a division between the between sum of squares (BSS) and the total sum of square (TSS). The BSS refers to the overall squared difference, for each data point, between the mean of its centroid and the overall mean. The TSS refers to the overall squared distance of the data points to the mean of all the means.

We can also see (under Available components) that we can examine other values we have not yet seen—totss is the total sum of squares, tot.withinss is the total of the sum of squares within clusters, between ss is the total sum of squares within clusters, size is the number of cases classified in each of the clusters, iter is the number of iterations required for convergence, and ifault signals warnings and problems (with a value of 0 if there is no issue).

We are now going to plot the differences between the value BSS/TSS for each of the clusters. Basically, this value shows how much of the data is explained by the clustering solution, as it divides the BSS by the TSS. It involves computing the ratio differences to a vector and using the plot() function. The first value is the ratio for the TwoClusters model.

1  v=rep(0,4)
2  v[1] = TwoClusters[[6]]/TwoClusters[[3]]
3  v[2] = (ThreeClusters[[6]]/ThreeClusters[[3]]) - v[1]
4  v[3] = (FourClusters[[6]]/FourClusters[[3]]) - sum(v[1:2])
5  v[4] = (FiveClusters[[6]]/FiveClusters[[3]]) - sum(v[1:3])
6  plot(v, xlab = "Number of clusters ",
7     ylab = "Ratio difference")

Following is the output:

Understanding the data with the all.us.city.crime.1970 dataset

Differences between the between and total sum of squares among our models

We can see in the preceding graph that the ratio is around .5 in the TwoClusters solution, and that it doesn't increase much with more clusters. The TwoClusters solution should therefore be preferred. Moreover, we have seen that solutions with more than two clusters are difficult to interpret. A BSS/TSS = 1 is the best possible value, yet it will seldom be reached.

Finding the best number of clusters in the life.expectancy.1971 dataset

We will now use the life.expectancy.1971 dataset about life expectancy in several countries in 1971. It includes 10 attributes: the country were the data has been collected, the year of data collection, and the life expectancy (remaining) for male and female individuals aged 0 years old, 25, 50, and 75. As with the previous dataset, this one also does not specify the membership of our cases to categories. So again, we will have to decide on the number of clusters by ourselves. We will examine how to do so more precisely. We will create a function for this purpose. Before we do that, let's discover the dataset we will use.

Let's start by loading and examining to dataset.

data(life.expectancy.1971)
life.expectancy.1971

A partial view of the output is provided below:

Finding the best number of clusters in the life.expectancy.1971 dataset

Even without computing the mean and standard deviations for the variables, we can notice that there is quite some variation regarding life expectancy (please refer to the complete output on your screen as well). A first observation, which is broadly documented, is that women have a longer remaining life expectancy than men, at all ages. A country strikes in this list—in Madagascar, at the time of data collection, women apparently did not have longer life expectancy than men in their young and old years. Further, the mean life expectancy at birth was only 38 for both women and men. This is also the life expectancy of females in Cameroon at that time, whereas males were expected to live even a little less (34 years). Looking at the table, we can notice that Trinidad and the US are entered several times, as data collection was carried out more than once. We will therefore discard case 23 (the second entry for Trinidad), as well as both cases 24 and 27 (US, data collected in 1966 and 1967) because cases 25 and 26 are more specific, as they provide estimations for White and Nonwhite individuals. Let's create a new dataset without these cases before we proceed with cluster analysis.

life = life.expectancy.1971[-c(23,24,27),]

Here we will scale the data. The importance of scaling data has been discussed in the first section of this chapter. We also add some attributes to the dataset, corresponding to the ratio of male life expectancy to female life expectancy at all ages, as the difference between male and females would be lost in data scaling (all means will be 0).

life.temp = cbind(life, life$m0/life$f0, life$m25/life$f25,
   life$m50/life$f50, life$m75/life$f75)

If you run this, you will notice an error. It happens that attribute f50 is composed of strings instead of numeric values (type mode(life$f50) to check this). This is a type of problem you might encounter when dealing with data you have not prepared yourself (and sometimes even with your data). The solution is obviously to convert the attribute to numeric values before being able to compute the ratios.

life$f50 = unlist(lapply(life$f50, as.numeric))

We can now repeat our assignment to life.temp with a successful result, and scale the data frame (omitting rows 1 and 2: name of country and year of data collection). We first convert to a data frame to get rid of information about mean and standard deviation that is contained in the returned object; we then convert to a matrix again.

life.scaled = as.matrix(data.frame( scale(life.temp[-(c(1,2))]) ))

We continue discussing this data next.

External validation

When examining the iris dataset, we had the correct solution regarding the number of clusters and the classification of cases. This is not the case here—we can not tell before running the analyses the number of groups in our data. We will therefore rely on computational trickery to discover them; cluster analysis will be performed iteratively and the clustering solutions will be compared using several indexes for determining the ideal number of clusters. More information about such indexes can be found in the paper Experiments for the number of clusters in k-means, by Chiang and Mirkin (2007). Here we rely on NbClust() function from the NbClust package, which we install and load:

install.packages("NbClust"); library(NbClust)

We simply call the function specifying the data and clustering algorithm to be used. By default, the function will perform clustering using the Euclidean distance and compute all available indexes. The reader is advised to consult the documentation for more information about customization.

NbClust(life.scaled, method = "kmeans")

Part of the output is provided below. This shows that three clusters is the most appropriate solution.

******************************************************************* 
* Among all indices:                                                
* 3 proposed 2 as the best number of clusters 
* 10 proposed 3 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 1 proposed 12 as the best number of clusters 
* 1 proposed 13 as the best number of clusters 
* 7 proposed 15 as the best number of clusters 
                   ***** Conclusion *****                            
* According to the majority rule, the best number of clusters is 3
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.186