Determining number of clusters

The beauty of clustering algorithms like K-means algorithm is that it does the clustering on the data with an unlimited number of features. It is a great tool to use when you have a raw data and would like to know the patterns in that data. However, deciding the number of clusters prior to doing the experiment might not be successful but may sometimes lead to an overfitting or underfitting problem. On the other hand, one common thing to all three algorithms (that is, K-means, bisecting K-means, and Gaussian mixture) is that the number of clusters must be determined in advance and supplied to the algorithm as a parameter. Hence, informally, determining the number of clusters is a separate optimization problem to be solved.

In this section, we will use a heuristic approach based on the Elbow method. We start from K = 2 clusters, and then we ran the K-means algorithm for the same data set by increasing K and observing the value of cost function Within-Cluster Sum of Squares (WCSS). At some point, a big drop in cost function can be observed, but then the improvement became marginal with the increasing value of K. As suggested in cluster analysis literature, we can pick the K after the last big drop of WCSS as an optimal one.

By analysing below parameters, you can find out the performance of K-means:

  • Betweenness: This is the between sum of squares also called as intracluster similarity.
  • Withiness: This is the within sum of square also called intercluster similarity.
  • Totwithinss: This is the sum of all the withiness of all the clusters also called total intracluster similarity.

It is to be noted that a robust and accurate clustering model will have a lower value of withiness and a higher value of betweenness. However, these values depend on the number of clusters, that is, K that is chosen before building the model.

Now let us discuss how to take advantage of the Elbow method to determine the number of clusters. As shown in the following, we calculated the cost function WCSS as a function of a number of clusters for the K-means algorithm applied to home data based on all the features. It can be observed that a big drop occurs when K = 5. Therefore, we chose the number of clusters as 5, as shown in Figure 10. Basically, this is the one after the last big drop.

Figure 14: Number of clusters as a function of WCSS
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.57.16