K-means clustering

As we did with hierarchical clustering, we can also use NbClust() to determine the optimum number of clusters for k-means. All you need to do is specify kmeans as the method in the function. Let's also loosen up the maximum number of clusters to 15. I've abbreviated the following output to just the conclusion:

> numKMeans <- NbClust::NbClust(wine_df,
    min.nc = 2,
    max.nc = 15,
    method = "kmeans")
                   
***** Conclusion ***** 
 
* According to the majority rule, the best number of clusters is 3

Once again, three clusters appears to be the optimum solution.

In R, we can use the kmeans() function to do this analysis. In addition to the input data, we have to specify the number of clusters we are solving for and a value for random assignments, the nstart argument. We will also need to specify a random seed:

    > set.seed(1234)
    
    > km <- kmeans(df, 3, nstart = 25)

Creating a table of the clusters gives us a sense of the distribution of the observations between them:

    > table(km$cluster)
    
     1  2  3 
    62 65 51

The number of observations per cluster is well balanced. I have seen on a number of occasions with larger datasets and many more features that no number of k-means yields a promising and compelling result. Another way to analyze the clustering is to look at a matrix of the cluster centers for each variable in each cluster:

    > km$centers
         Alcohol  MalicAcid        Ash    Alk_ash   magnesium   T_phenols
    1  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
    2 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
    3  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
       Flavanoids    Non_flav    Proantho C_Intensity        Hue  OD280_315
    1  0.97506900 -0.56050853  0.57865427   0.1705823  0.4726504  0.7770551
    2  0.02075402 -0.03343924  0.05810161  -0.8993770  0.4605046  0.2700025
    3 -1.21182921  0.72402116 -0.77751312   0.9388902 -1.1615122 -1.2887761
         Proline
    1  1.1220202
    2 -0.7517257
    3 -0.4059428

Note that cluster one has, on average, a higher alcohol content. Let's produce a box plot to look at the distribution of alcohol content and compare it to Ward's:

> par(mfrow = c(1, 2))

> boxplot(wine$Alcohol ~ km$cluster, data = wine, 
    main = "Alcohol Content, K-Means")

> boxplot(wine$Alcohol ~ ward_clusters, data = wine, 
    main = "Alcohol Content, Ward's")

This is the output:

The alcohol content for each cluster is almost exactly the same. On the surface, this tells me that three clusters is the proper latent structure for the wines and there is little difference between using k-means or hierarchical clustering. Finally, let's do a comparison of the k-means clusters versus the cultivars:

    > table(km$cluster, wine$Class)
    
         1  2  3
      1 59  3  0
      2  0 65  0
      3  0  3 48

This is very similar to the distribution produced by Ward's method, and either one would probably be acceptable to our hypothetical sommelier.

However, to demonstrate how you can cluster on data with both numeric and non-numeric values, let's work through some more examples.

Table of Contents for K-means clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
K-means clustering