A comparative analysis between clustering algorithms

Gaussian mixture is used mainly for expectation minimization, which is an example of an optimization algorithm. Bisecting K-means, which is faster than regular K-means, also produces slightly different clustering results. Below we try to compare these three algorithms. We will show a performance comparison in terms of model building time and the computional cost for each algorithm. As shown in the following code, we can compute the cost in terms of WCSS. The following lines of code can be used to compute the WCSS for the K-means and bisecting algorithms:

val WCSSS = model.computeCost(landRDD) // land RDD is the training set 
println("Within-Cluster Sum of Squares = " + WCSSS) // Less is better

For the dataset we used throughout this chapter, we got the following values of WCSS:

Within-Cluster Sum of Squares of Bisecting K-means = 2.096980212594632E11 
Within-Cluster Sum of Squares of K-means = 1.455560123603583E12

This means that K-means shows slightly better performance in terms of the compute cost. Unfortunately, we don't have any metrics like WCSS for the GMM algorithm. Now let's observe the model building time for these three algorithms. We can start the system clock before starting model training and stop it immediately after the training has been finished as follows (for K-means):

val start = System.currentTimeMillis() 
val numClusters = 5 
val numIterations = 20  
val seed = 12345 
val runs = 50 
val model = KMeans.train(landRDD, numClusters, numIterations, seed) 
val end = System.currentTimeMillis()
println("Model building and prediction time: "+ {end - start} + "ms")

For the training set we used throughout this chapter, we got the following values of model building time:

Model building and prediction time for Bisecting K-means: 2680ms 
Model building and prediction time for Gaussian Mixture: 2193ms 
Model building and prediction time for K-means: 3741ms

In different research articles, it has been found that the bisecting K-means algorithm has been shown to result in better cluster assignment for data points. Moreover, compared to K-means, bisecting K-means, alos converges well towards global minima. K-means on the other hand, gets stuck in local minima. In other words, using bisecting K-means algorithm, we can avoid the local minima that K-means can suffer from.

Note that you might observe different values of the preceding parameters depending upon your machine's hardware configuration and the random nature of the dataset.

More details analysis is up to the readers from the theoretical views. Interested readers should also refer to Spark MLlib-based clustering techniques at https://spark.apache.org/docs/latest/mllib-clustering.html to get more insights.

Table of Contents for A comparative analysis between clustering algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
A comparative analysis between clustering algorithms