Bisecting K-means with Spark MLlib

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. A bisecting K-means algorithm is based on the paper, A comparison of document clustering techniques by Steinbach, Karypis, and Kumar, with modification to fit with Spark MLlib.

Bisecting K-means is a kind of divisive algorithm that starts from a single cluster that contains all the data points. Iteratively, it then finds all the divisible clusters on the bottom level and bisects each of them using K-means until there are K leaf clusters in total or no leaf clusters divisible. After that, clusters on the same level are grouped together to increase the parallelism. In other words, bisecting K-means is computationally faster than the regular K-means algorithm. Note that if bisecting all the divisible clusters on the bottom level results in more than K leaf clusters, larger clusters will always get higher priority.

Note that if the bisecting of all the divisible clusters on the bottom level results in more than K leaf clusters, larger clusters will always get higher priority. The following parameters are used in the Spark MLlib implementation:

  • K: This is the desired number of leaf clusters. However, the actual number could be smaller if there are no divisible leaf clusters left during the computation. The default value is 4.
  • MaxIterations: This is the max number of K-means iterations to split the clusters. The default value is 20.
  • MinDivisibleClusterSize: This is the minimum number of points. The default value is set as 1.
  • Seed: This is a random seed that disallows random clustering and tries to provide almost similar result in each iteration. However, it is recommended to use a long seed value like 12345 and so on.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.20.90