In the previous chapter, we concentrated on how to compress information found in a number of continuous variables into a smaller set of numbers, but these statistical methods are somewhat limited when we are dealing with categorized data, for example when analyzing surveys.
Although some methods try to convert discrete variables into numeric ones, such as by using a number of dummy or indicator variables, in most cases it's simply better to think about our research design goals instead of trying to forcibly use previously learned methods in the analysis.
We can replace a categorical variable with a number of dummy variables by creating a new variable for each label of the original discrete variable, and then assign 1 to the related column and 0 to all the others. Such values can be used as numeric variables in statistical analysis, especially with regression models.
When we analyze a sample and target population via categorical variables, usually we are not interested in individual cases, but instead in similar elements and groups. Similar elements can be defined as rows in a dataset with similar values in the columns.
In this chapter, we will discuss different supervised and unsupervised ways to identify similar cases in a dataset, such as:
Clustering is an unsupervised data analysis method that is used in diverse fields, such as pattern recognition, social sciences, and pharmacy. The aim of cluster analysis is to make homogeneous subgroups called clusters, where the objects in the same cluster are similar, and the clusters differ from each other.
Cluster analysis is one of the most well known and popular pattern recognition methods; thus, there are many clustering models and algorithms analyzing the distribution, density, possible center points, and so on in the dataset. In this section we are going to examine some hierarchical clustering methods.
Hierarchical clustering
can be either agglomerative or divisive. In agglomerative methods every case starts out as an individual cluster, then the closest clusters are merged together in an iterative manner, until finally they merge into one single cluster, which includes all elements of the original dataset. The biggest problem with this approach is that distances between clusters have to be recalculated at each iteration, which makes it extremely slow on large data. I'd rather not suggest trying to run the following commands on the hflights
dataset.
Divisive methods on the other hand take a top-down approach. They start from a single cluster, which is then iteratively divided into smaller groups until they are all singletons.
The stats
package contains the hclust
function for hierarchical clustering that takes a distance matrix as an input. To see how it works, let's use the mtcars
dataset that we already analyzed in Chapter 3, Filtering and Summarizing Data and Chapter 9, From Big to Smaller Data. The dist
function is also familiar from the latter chapter:
> d <- dist(mtcars) > h <- hclust(d) > h Call: hclust(d = d) Cluster method : complete Distance : euclidean Number of objects: 32
Well, this is a way too brief output and only shows that our distance matrix included 32 elements and the clustering method. A visual representation of the results will be a lot more useful for such a small dataset:
> plot(h)
By plotting this hclust
object, we obtained a dendrogram, which shows how the clusters are formed. It can be useful for determining the number of clusters, although in datasets with numerous cases it becomes difficult to interpret. A horizontal line can be drawn to any given height on the y axis so that the n number of intersections with the line provides a n-cluster solution.
R can provide very convenient ways of visualizing the clusters on the dendrogram. In the following plot, the red boxes show the cluster membership of a three-cluster solution on top of the previous plot:
> plot(h) > rect.hclust(h, k=3, border = "red")
Although this graph looks nice and it is extremely useful to have similar elements grouped together, for bigger datasets, it becomes hard to see through. Instead, we might be rather interested in the actual cluster membership represented in a vector:
> (cn <- cutree(h, k = 3)) Mazda RX4 Mazda RX4 Wag Datsun 710 1 1 1 Hornet 4 Drive Hornet Sportabout Valiant 2 3 2 Duster 360 Merc 240D Merc 230 3 1 1 Merc 280 Merc 280C Merc 450SE 1 1 2 Merc 450SL Merc 450SLC Cadillac Fleetwood 2 2 3 Lincoln Continental Chrysler Imperial Fiat 128 3 3 1 Honda Civic Toyota Corolla Toyota Corona 1 1 1 Dodge Challenger AMC Javelin Camaro Z28 2 2 3 Pontiac Firebird Fiat X1-9 Porsche 914-2 3 1 1 Lotus Europa Ford Pantera L Ferrari Dino 1 3 1 Maserati Bora Volvo 142E 3 1
And the number of elements in the resulting clusters as a frequency table:
> table(cn) 1 2 3 16 7 9
It seems that Cluster 1, the third cluster on the preceding plot, has the most elements. Can you guess how this group differs from the other two clusters? Well, those readers who are familiar with car names might be able to guess the answer, but let's see what the numbers actually show:
> round(aggregate(mtcars, FUN = mean, by = list(cn)), 1) Group.1 mpg cyl disp hp drat wt qsec vs am gear carb 1 1 24.5 4.6 122.3 96.9 4.0 2.5 18.5 0.8 0.7 4.1 2.4 2 2 17.0 7.4 276.1 150.7 3.0 3.6 18.1 0.3 0.0 3.0 2.1 3 3 14.6 8.0 388.2 232.1 3.3 4.2 16.4 0.0 0.2 3.4 4.0
There's a really spectacular difference in the average performance and gas consumption between the clusters! What about the standard deviation inside the groups?
> round(aggregate(mtcars, FUN = sd, by = list(cn)), 1) Group.1 mpg cyl disp hp drat wt qsec vs am gear carb 1 1 5.0 1 34.6 31.0 0.3 0.6 1.8 0.4 0.5 0.5 1.5 2 2 2.2 1 30.2 32.5 0.2 0.3 1.2 0.5 0.0 0.0 0.9 3 3 3.1 0 58.1 49.4 0.4 0.9 1.3 0.0 0.4 0.9 1.7
These values are pretty low compared to the standard deviations in the original dataset:
> round(sapply(mtcars, sd), 1) mpg cyl disp hp drat wt qsec vs am gear carb 6.0 1.8 123.9 68.6 0.5 1.0 1.8 0.5 0.5 0.7 1.6
And the same applies when compared to the standard deviation between the groups as well:
> round(apply( + aggregate(mtcars, FUN = mean, by = list(cn)), + 2, sd), 1) Group.1 mpg cyl disp hp drat wt qsec 1.0 5.1 1.8 133.5 68.1 0.5 0.8 1.1 vs am gear carb 0.4 0.4 0.6 1.0
This means that we achieved our original goal to identify similar elements of our data and organize those in groups that differ from each other. But why did we split the original data into exactly three artificially defined groups? Why not two, four, or even more?
The NbClust
package offers a very convenient way to do some exploratory data analysis on our data before running the actual cluster analysis. The main function of the package can compute 30 different indices, all designed to determine the ideal number of groups. These include:
After loading the package, let's start with a visual method representing the possible number of clusters in our data—on a knee plot, which might be familiar from Chapter 9, From Big to Smaller Data, where you can also find some more information about the following elbow-rule:
> library(NbClust) > NbClust(mtcars, method = 'complete', index = 'dindex')
In the preceding plots, we traditionally look for the elbow, but the second differences plot on the right might be more straightforward for most readers. There we are interested in where the most significant peak can be found, which suggests that choosing three groups would be ideal when clustering the mtcars
dataset.
Unfortunately, running all NbClust
methods fails on such a small dataset. Thus, for demonstrational purposes, we are now running only a few standard methods and filtering the results for the suggested number of clusters via the related list element:
> NbClust(mtcars, method = 'complete', index = 'hartigan')$Best.nc All 32 observations were used. Number_clusters Value_Index 3.0000 34.1696 > NbClust(mtcars, method = 'complete', index = 'kl')$Best.nc All 32 observations were used. Number_clusters Value_Index 3.0000 6.8235
Both the Hartigan and Krzanowski-Lai indexes suggest sticking to three clusters. Let's view the iris
dataset as well, which includes a lot more cases with fewer numeric columns, so we can run all available methods:
> NbClust(iris[, -5], method = 'complete', index = 'all')$Best.nc[1,] All 150 observations were used. ******************************************************************* * Among all indices: * 2 proposed 2 as the best number of clusters * 13 proposed 3 as the best number of clusters * 5 proposed 4 as the best number of clusters * 1 proposed 6 as the best number of clusters * 2 proposed 15 as the best number of clusters ***** Conclusion ***** * According to the majority rule, the best number of clusters is 3 ******************************************************************* KL CH Hartigan CCC Scott Marriot 4 4 3 3 3 3 TrCovW TraceW Friedman Rubin Cindex DB 3 3 4 6 3 3 Silhouette Duda PseudoT2 Beale Ratkowsky Ball 2 4 4 3 3 3 PtBiserial Frey McClain Dunn Hubert SDindex 3 1 2 15 0 3 Dindex SDbw 0 15
The output summarizes that the ideal number of clusters is three based on the 13 methods returning that number, five further methods suggest four clusters, and a few other cluster numbers were also computed by a much smaller number of methods.
These methods are not only useful with the previously discussed hierarchical clustering, but generally used with k-means clustering as well, where the number of clusters is to be defined before running the analysis—unlike the hierarchical method, where we cut the dendogram after the heavy computations have already been run.
K-means clustering is a non-hierarchical method first described by MacQueen in 1967. Its big advantage over hierarchical clustering is its great performance.
The algorithm runs the following steps in a nutshell:
We are going to use the kmeans
function from the stats
package. As k-means clustering requires a prior decision on the number of clusters, we can either use the NbClust
function described previously, or we can come up with an arbitrary number that fits the goals of the analysis.
According to the previously defined optimal cluster number in the previous section, we are going to stick to three groups, where the within-cluster sum of squares ceases to drop significantly:
> (k <- kmeans(mtcars, 3)) K-means clustering with 3 clusters of sizes 16, 7, 9 Cluster means: mpg cyl disp hp drat wt qsec 1 24.50000 4.625000 122.2937 96.8750 4.002500 2.518000 18.54312 2 17.01429 7.428571 276.0571 150.7143 2.994286 3.601429 18.11857 3 14.64444 8.000000 388.2222 232.1111 3.343333 4.161556 16.40444 vs am gear carb 1 0.7500000 0.6875000 4.125000 2.437500 2 0.2857143 0.0000000 3.000000 2.142857 3 0.0000000 0.2222222 3.444444 4.000000 Clustering vector: Mazda RX4 Mazda RX4 Wag Datsun 710 1 1 1 Hornet 4 Drive Hornet Sportabout Valiant 2 3 2 Duster 360 Merc 240D Merc 230 3 1 1 Merc 280 Merc 280C Merc 450SE 1 1 2 Merc 450SL Merc 450SLC Cadillac Fleetwood 2 2 3 Lincoln Continental Chrysler Imperial Fiat 128 3 3 1 Honda Civic Toyota Corolla Toyota Corona 1 1 1 Dodge Challenger AMC Javelin Camaro Z28 2 2 3 Pontiac Firebird Fiat X1-9 Porsche 914-2 3 1 1 Lotus Europa Ford Pantera L Ferrari Dino 1 3 1 Maserati Bora Volvo 142E 3 1 Within cluster sum of squares by cluster: [1] 32838.00 11846.09 46659.32 (between_SS / total_SS = 85.3 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" "iter" [9] "ifault"
The cluster means show some really important characteristics for each cluster, which we generated manually for the hierarchical clusters in the previous section. We can see that, in the first cluster, the cars have high mpg (low gas consumption), on average four cylinders (in contrast to six or eight), rather low performance and so on. The output also automatically reveals the actual cluster numbers.
Let's compare these to the clusters defined by the hierarchical method:
> all(cn == k$cluster) [1] TRUE
The results seem to be pretty stable, right?
The cluster numbers have no meaning and their order is arbitrary. In other words, the cluster membership is a nominal variable. Based on this, the preceding R command might return FALSE
instead of TRUE
when the cluster numbers were allocated in a different order, but comparing the actual cluster membership will verify that we have found the very same groups. See for example cbind(cn, k$cluster)
to generate a table including both cluster memberships.
Plotting these clusters is also a great way to understand groupings. To this end, we will use the clusplot
function from the cluster
package. For easier understanding, this function reduces the number of dimensions to two, in a similar way to when we are conducting a PCA or MDS (described in Chapter 9, From Big to Smaller Data):
> library(cluster) > clusplot(mtcars, k$cluster, color = TRUE, shade = TRUE, labels = 2)
As you can see, after the dimension reduction, the two components explain 84.17 percent of variance, so this small information loss is a great trade-off in favor of an easier understanding of the clusters.
Visualizing the relative density of the ellipses with the shade
parameter can also help us realize how similar the elements of the same groups are. And we used the labels argument to show both the points and cluster labels as well. Be sure to stick to the default of 0 (no labels) or 4 (only ellipse labels) when visualizing large number of elements.
18.188.151.107