Chapter 10. Classification and Clustering

In the previous chapter, we concentrated on how to compress information found in a number of continuous variables into a smaller set of numbers, but these statistical methods are somewhat limited when we are dealing with categorized data, for example when analyzing surveys.

Although some methods try to convert discrete variables into numeric ones, such as by using a number of dummy or indicator variables, in most cases it's simply better to think about our research design goals instead of trying to forcibly use previously learned methods in the analysis.

Note

We can replace a categorical variable with a number of dummy variables by creating a new variable for each label of the original discrete variable, and then assign 1 to the related column and 0 to all the others. Such values can be used as numeric variables in statistical analysis, especially with regression models.

When we analyze a sample and target population via categorical variables, usually we are not interested in individual cases, but instead in similar elements and groups. Similar elements can be defined as rows in a dataset with similar values in the columns.

In this chapter, we will discuss different supervised and unsupervised ways to identify similar cases in a dataset, such as:

  • Hierarchical clustering
  • K-means clustering
  • Some machine learning algorithms
  • Latent class model
  • Discriminant analysis
  • Logistic regression

Cluster analysis

Clustering is an unsupervised data analysis method that is used in diverse fields, such as pattern recognition, social sciences, and pharmacy. The aim of cluster analysis is to make homogeneous subgroups called clusters, where the objects in the same cluster are similar, and the clusters differ from each other.

Hierarchical clustering

Cluster analysis is one of the most well known and popular pattern recognition methods; thus, there are many clustering models and algorithms analyzing the distribution, density, possible center points, and so on in the dataset. In this section we are going to examine some hierarchical clustering methods.

Hierarchical clustering can be either agglomerative or divisive. In agglomerative methods every case starts out as an individual cluster, then the closest clusters are merged together in an iterative manner, until finally they merge into one single cluster, which includes all elements of the original dataset. The biggest problem with this approach is that distances between clusters have to be recalculated at each iteration, which makes it extremely slow on large data. I'd rather not suggest trying to run the following commands on the hflights dataset.

Divisive methods on the other hand take a top-down approach. They start from a single cluster, which is then iteratively divided into smaller groups until they are all singletons.

The stats package contains the hclust function for hierarchical clustering that takes a distance matrix as an input. To see how it works, let's use the mtcars dataset that we already analyzed in Chapter 3, Filtering and Summarizing Data and Chapter 9, From Big to Smaller Data. The dist function is also familiar from the latter chapter:

> d <- dist(mtcars)
> h <- hclust(d)
> h

Call:
hclust(d = d)

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 32

Well, this is a way too brief output and only shows that our distance matrix included 32 elements and the clustering method. A visual representation of the results will be a lot more useful for such a small dataset:

> plot(h)
Hierarchical clustering

By plotting this hclust object, we obtained a dendrogram, which shows how the clusters are formed. It can be useful for determining the number of clusters, although in datasets with numerous cases it becomes difficult to interpret. A horizontal line can be drawn to any given height on the y axis so that the n number of intersections with the line provides a n-cluster solution.

R can provide very convenient ways of visualizing the clusters on the dendrogram. In the following plot, the red boxes show the cluster membership of a three-cluster solution on top of the previous plot:

> plot(h)
> rect.hclust(h, k=3, border = "red")
Hierarchical clustering

Although this graph looks nice and it is extremely useful to have similar elements grouped together, for bigger datasets, it becomes hard to see through. Instead, we might be rather interested in the actual cluster membership represented in a vector:

> (cn <- cutree(h, k = 3))
          Mazda RX4       Mazda RX4 Wag          Datsun 710 
                  1                   1                   1 
     Hornet 4 Drive   Hornet Sportabout             Valiant 
                  2                   3                   2 
         Duster 360           Merc 240D            Merc 230 
                  3                   1                   1 
           Merc 280           Merc 280C          Merc 450SE 
                  1                   1                   2 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
                  2                   2                   3 
Lincoln Continental   Chrysler Imperial            Fiat 128 
                  3                   3                   1 
        Honda Civic      Toyota Corolla       Toyota Corona 
                  1                   1                   1 
   Dodge Challenger         AMC Javelin          Camaro Z28 
                  2                   2                   3 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2 
                  3                   1                   1 
       Lotus Europa      Ford Pantera L        Ferrari Dino 
                  1                   3                   1 
      Maserati Bora          Volvo 142E 
                  3                   1

And the number of elements in the resulting clusters as a frequency table:

> table(cn)
 1  2  3 
16  7  9

It seems that Cluster 1, the third cluster on the preceding plot, has the most elements. Can you guess how this group differs from the other two clusters? Well, those readers who are familiar with car names might be able to guess the answer, but let's see what the numbers actually show:

Note

Please note that we use the round function in the following examples to suppress the number of decimal places to 1 or 4 in the code output to fit the page width.

> round(aggregate(mtcars, FUN = mean, by = list(cn)), 1)
  Group.1  mpg cyl  disp    hp drat  wt qsec  vs  am gear carb
1       1 24.5 4.6 122.3  96.9  4.0 2.5 18.5 0.8 0.7  4.1  2.4
2       2 17.0 7.4 276.1 150.7  3.0 3.6 18.1 0.3 0.0  3.0  2.1
3       3 14.6 8.0 388.2 232.1  3.3 4.2 16.4 0.0 0.2  3.4  4.0

There's a really spectacular difference in the average performance and gas consumption between the clusters! What about the standard deviation inside the groups?

> round(aggregate(mtcars, FUN = sd, by = list(cn)), 1)
  Group.1 mpg cyl disp   hp drat  wt qsec  vs  am gear carb
1       1 5.0   1 34.6 31.0  0.3 0.6  1.8 0.4 0.5  0.5  1.5
2       2 2.2   1 30.2 32.5  0.2 0.3  1.2 0.5 0.0  0.0  0.9
3       3 3.1   0 58.1 49.4  0.4 0.9  1.3 0.0 0.4  0.9  1.7

These values are pretty low compared to the standard deviations in the original dataset:

> round(sapply(mtcars, sd), 1)
  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
  6.0   1.8 123.9  68.6   0.5   1.0   1.8   0.5   0.5   0.7   1.6

And the same applies when compared to the standard deviation between the groups as well:

> round(apply(
+   aggregate(mtcars, FUN = mean, by = list(cn)),
+   2, sd), 1)
Group.1     mpg     cyl    disp      hp    drat      wt    qsec 
    1.0     5.1     1.8   133.5    68.1     0.5     0.8     1.1 
     vs      am    gear    carb 
    0.4     0.4     0.6     1.0

This means that we achieved our original goal to identify similar elements of our data and organize those in groups that differ from each other. But why did we split the original data into exactly three artificially defined groups? Why not two, four, or even more?

Determining the ideal number of clusters

The NbClust package offers a very convenient way to do some exploratory data analysis on our data before running the actual cluster analysis. The main function of the package can compute 30 different indices, all designed to determine the ideal number of groups. These include:

  • Single link
  • Average
  • Complete link
  • McQuitty
  • Centroid (cluster center)
  • Median
  • K-means
  • Ward

After loading the package, let's start with a visual method representing the possible number of clusters in our data—on a knee plot, which might be familiar from Chapter 9, From Big to Smaller Data, where you can also find some more information about the following elbow-rule:

> library(NbClust)
> NbClust(mtcars, method = 'complete', index = 'dindex')
Determining the ideal number of clusters

In the preceding plots, we traditionally look for the elbow, but the second differences plot on the right might be more straightforward for most readers. There we are interested in where the most significant peak can be found, which suggests that choosing three groups would be ideal when clustering the mtcars dataset.

Unfortunately, running all NbClust methods fails on such a small dataset. Thus, for demonstrational purposes, we are now running only a few standard methods and filtering the results for the suggested number of clusters via the related list element:

> NbClust(mtcars, method = 'complete', index = 'hartigan')$Best.nc
All 32 observations were used. 
 
Number_clusters     Value_Index 
         3.0000         34.1696 
> NbClust(mtcars, method = 'complete', index = 'kl')$Best.nc
All 32 observations were used. 
 
Number_clusters     Value_Index 
         3.0000          6.8235

Both the Hartigan and Krzanowski-Lai indexes suggest sticking to three clusters. Let's view the iris dataset as well, which includes a lot more cases with fewer numeric columns, so we can run all available methods:

> NbClust(iris[, -5], method = 'complete', index = 'all')$Best.nc[1,]
All 150 observations were used. 
 
******************************************************************* 
* Among all indices:                                                
* 2 proposed 2 as the best number of clusters 
* 13 proposed 3 as the best number of clusters 
* 5 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 2 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 ******************************************************************* 
        KL         CH   Hartigan        CCC      Scott    Marriot 
         4          4          3          3          3          3 
    TrCovW     TraceW   Friedman      Rubin     Cindex         DB 
         3          3          4          6          3          3 
Silhouette       Duda   PseudoT2      Beale  Ratkowsky       Ball 
         2          4          4          3          3          3 
PtBiserial       Frey    McClain       Dunn     Hubert    SDindex 
         3          1          2         15          0          3 
    Dindex       SDbw 
         0         15

The output summarizes that the ideal number of clusters is three based on the 13 methods returning that number, five further methods suggest four clusters, and a few other cluster numbers were also computed by a much smaller number of methods.

These methods are not only useful with the previously discussed hierarchical clustering, but generally used with k-means clustering as well, where the number of clusters is to be defined before running the analysis—unlike the hierarchical method, where we cut the dendogram after the heavy computations have already been run.

K-means clustering

K-means clustering is a non-hierarchical method first described by MacQueen in 1967. Its big advantage over hierarchical clustering is its great performance.

Note

Unlike hierarchical cluster analysis, k-means clustering requires you to determine the number of clusters before running the actual analysis.

The algorithm runs the following steps in a nutshell:

  1. Initialize a predefined (k) number of randomly chosen centroids in space.
  2. Assign each object to the cluster with the closest centroid.
  3. Recalculate centroids.
  4. Repeat the second and third steps until convergence.

We are going to use the kmeans function from the stats package. As k-means clustering requires a prior decision on the number of clusters, we can either use the NbClust function described previously, or we can come up with an arbitrary number that fits the goals of the analysis.

According to the previously defined optimal cluster number in the previous section, we are going to stick to three groups, where the within-cluster sum of squares ceases to drop significantly:

> (k <- kmeans(mtcars, 3))
K-means clustering with 3 clusters of sizes 16, 7, 9

Cluster means:
       mpg      cyl     disp       hp     drat       wt     qsec
1 24.50000 4.625000 122.2937  96.8750 4.002500 2.518000 18.54312
2 17.01429 7.428571 276.0571 150.7143 2.994286 3.601429 18.11857
3 14.64444 8.000000 388.2222 232.1111 3.343333 4.161556 16.40444
         vs        am     gear     carb
1 0.7500000 0.6875000 4.125000 2.437500
2 0.2857143 0.0000000 3.000000 2.142857
3 0.0000000 0.2222222 3.444444 4.000000

Clustering vector:
          Mazda RX4       Mazda RX4 Wag          Datsun 710 
                  1                   1                   1 
     Hornet 4 Drive   Hornet Sportabout             Valiant 
                  2                   3                   2 
         Duster 360           Merc 240D            Merc 230 
                  3                   1                   1 
           Merc 280           Merc 280C          Merc 450SE 
                  1                   1                   2 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
                  2                   2                   3 
Lincoln Continental   Chrysler Imperial            Fiat 128 
                  3                   3                   1 
        Honda Civic      Toyota Corolla       Toyota Corona 
                  1                   1                   1 
   Dodge Challenger         AMC Javelin          Camaro Z28 
                  2                   2                   3 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2 
                  3                   1                   1 
       Lotus Europa      Ford Pantera L        Ferrari Dino 
                  1                   3                   1 
      Maserati Bora          Volvo 142E 
                  3                   1 

Within cluster sum of squares by cluster:
[1] 32838.00 11846.09 46659.32
 (between_SS / total_SS =  85.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"      

The cluster means show some really important characteristics for each cluster, which we generated manually for the hierarchical clusters in the previous section. We can see that, in the first cluster, the cars have high mpg (low gas consumption), on average four cylinders (in contrast to six or eight), rather low performance and so on. The output also automatically reveals the actual cluster numbers.

Let's compare these to the clusters defined by the hierarchical method:

> all(cn == k$cluster)
[1] TRUE

The results seem to be pretty stable, right?

Tip

The cluster numbers have no meaning and their order is arbitrary. In other words, the cluster membership is a nominal variable. Based on this, the preceding R command might return FALSE instead of TRUE when the cluster numbers were allocated in a different order, but comparing the actual cluster membership will verify that we have found the very same groups. See for example cbind(cn, k$cluster) to generate a table including both cluster memberships.

Visualizing clusters

Plotting these clusters is also a great way to understand groupings. To this end, we will use the clusplot function from the cluster package. For easier understanding, this function reduces the number of dimensions to two, in a similar way to when we are conducting a PCA or MDS (described in Chapter 9, From Big to Smaller Data):

> library(cluster) 
> clusplot(mtcars, k$cluster, color = TRUE, shade = TRUE, labels = 2)
Visualizing clusters

As you can see, after the dimension reduction, the two components explain 84.17 percent of variance, so this small information loss is a great trade-off in favor of an easier understanding of the clusters.

Visualizing the relative density of the ellipses with the shade parameter can also help us realize how similar the elements of the same groups are. And we used the labels argument to show both the points and cluster labels as well. Be sure to stick to the default of 0 (no labels) or 4 (only ellipse labels) when visualizing large number of elements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.151.107