K-means++

We have said that a good choice for the initial centroids can improve the convergence speed and leads to a minimum that is closer to the global optimum of the inertia S. Arthur and Vassilvitskii (in The Advantages of Careful Seeding, Arthur, D., Vassilvitskii S., k-means++: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms) proposed a method called K-means++, which allows increasing the accuracy of the initial centroid guess considering the most likely final configuration.

In order to expose the algorithm, it's useful to introduce a function, D(x, i), which is defined as:

D(x, i) defines the shortest distance between each sample and one of the centroids already selected. As the process is incremental, this function must be recomputed after all steps. For our purposes, let's also define an auxiliary probability distribution (we omit the index variable for simplicity):

The first centroid μ₀ is sampled from X using a uniform distribution. The next steps are:

Compute D(x, i) for all x ∈ X considering the centroids already selected
Compute G(x)
Select the next centroid μ_i from X with a probability G(x)

In the aforementioned paper, the authors showed a very important property. If we define S^* as the global optimum of S, a K-means++ initialization determines an upperbound for the expected value of the actual inertia:

This condition is often expressed by saying that K-means++ is O(log k)-competitive. When k is sufficiently small, the probability of finding a local minimum close to the global one increases. However, K-means++ is still a probabilistic approach and different initializations on the same dataset lead to different initial configurations. A good practice is to run a limited number of initializations (for example, ten) and pick the one associated with the smallest inertia. When training complexity is not a primary issue, this number can be increased, but different experiments showed that the improvement achievable with a very large number of trials is negligible when compared to the actual computational cost. The default value in Scikit-Learn is ten and the author suggests to keep this value in the majority of cases. If the result continues to be poor, it's preferable to pick another method. Moreover, there are problems that cannot be solved using K-means (even with the best possible initialization), because one of the assumptions of the algorithm is that each cluster is a hypersphere and the distances are measured using a Euclidean function. In the following sections, we're going to analyze other algorithms that are not constrained to work with such limitations and can easily solve clustering problems using asymmetric cluster geometries.

Table of Contents for K-means++

Create new playlist

Sign In

Sign Up

Table of Contents for
K-means++