7.6 Cluster Analysis Shortcomings and Solutions

The basic K-means cluster analysis algorithm is easy to implement, but it can lead to a number of problems. We briefly describe a few of these problems here and leave the solutions for you as exercises.

Step 2 of our algorithm required that we pick k data points to serve as the initial centroids. Our solution was to use a random selection, but this meant that two runs of the program would likely produce different results. It seems intuitive that by choosing the centroids in a more intentional manner, we can guide the way in which the clusters are ultimately constructed. This determination could be based either on user input or on data analysis.

It is possible that clusters may become empty as the iteration process continues. In our implementation, once a cluster becomes empty, there is no way to repopulate it, since it no longer has a centroid. As an alternative, when a cluster becomes empty, we might like to create a new centroid so that data points can be added in the next iteration. Of course, it is always possible to leave the cluster empty and produce fewer clusters than originally specified—that is what would happen in our implementation.

Sometimes a cluster can get too large or can encompass data points that are seemingly not related. This can happen when some data points in the data set are clearly different from the rest (such points are referred to as outliers). When an outlier is found, we may want to provide some special processing to create an additional cluster for the outlier, or to exclude the outlier from any cluster, thereby nullifying the impact the outlier might have on the centroid calculations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.197.143