Comparing clustering algorithms

There are around thirteen different clustering algorithms in the sklearn library. Having thirteen different sets of choices, the question is: what clustering algorithms should you use? The answer is your data. What type of data you have and which clustering you would like to apply on it is how you will select the algorithm. Having said that, there can be many possible algorithms that could be useful for the kind of problem and data you have. Each of the thirteen classes in sklearn is specialized for specific tasks (such as co-clustering and bi-clustering or clustering features instead of data points). An algorithm specializing in text clustering would be the right choice for clustering text data. Hence, if you know enough about the data you are dealing with, you can limit your options on using the clustering algorithm that best suits that data, the essential properties your data has, or the type of clustering you need to do.

However, what if you do not have any insight about your data? For example, let's assume that you are merely observing and performing some Exploratory Data Analysis (EDA); based on that, choosing a specialized algorithm is not easy. So, the question is: for exploring the data, which algorithm will be more suitable?

To understand what you need a unique EDA clustering algorithm to do, some ground rules need to be in place:

When working on EDA, you essentially try to study and obtain some insight into the data you are working with. In which case, it is better to get no result at all than an incorrect outcome. Bad outcomes lead to wrong instincts, and you are then sent down an ultimately incorrect path and you misapprehend the dataset. Having said that, a great EDA clustering algorithm should ideally be stable in its clustering—an algorithm willing to not assign points to clusters; if points are not in clusters they should not be grouped.
The clustering algorithms are mostly comprised of many different parameters and, to alter things, some control needs to be in your hands. But how do you practically choose correct settings for those many numbers of parameters? Even if you have a minimal idea about the data at hand, it can still be very tough to decide a value or an arrangement for each parameter. Having said that, parameters should be insightful enough that you can probably set them without possessing a lot of knowledge about your data.
Running the same clustering algorithm more than once with various random parameter initializations should result in approximately the exact same clusters. When sampling the data, a diverse arbitrary data point should not entirely modify the resultant cluster structure (unless the strategy of sampling has intricacies). The clustering should modify in a moderately stable, predictable order if you tweak the clustering algorithms.
Finally, datasets are only getting bigger and bigger day by day as the compute power is increasing. You can subset from the sample data, but eventually, your clustering algorithm should be scalable to larger data sets. A clustering algorithm is not much use if used for a subset of data that does not represent the data at large!

Soft clusters or overlapping clusters are some of the additional excellent features that exist; however, the preceding requirements are more than enough to get started. Strangely, only a handful of clustering algorithms can satisfy them all!

Congratulations! Another wonderful journey comes to an end.

Table of Contents for Comparing clustering algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Comparing clustering algorithms