How does K-means algorithm work?

Suppose we have n data points x_i, i=1...n that need to be partitioned into k clusters. Now that the target here is to assign a cluster to each data point. K-means then aims to find the positions μ_i,i=1...k of the clusters that minimize the distance from the data points to the cluster. Mathematically, the K-means algorithm tries to achieve the goal by solving the following equation, that is, an optimization problem:

In the preceding equation, c_i is the set of data points assigned to cluster i, and d(x,μ_i) =||x−μ_i||²₂ is the Euclidean distance to be calculated (we will explain why we should use this distance measurement shortly). Therefore, we can understand that the overall clustering operation using K-means is not a trivial one but an NP-hard optimization problem. This also means that the K-means algorithm not only tries to find the global minima but also often gets stuck in different solutions.

Now, let's see how we could formulate the algorithm before we can feed the data to the K-means model. At first, we need to decide the number of tentative clusters, k priory. Then, typically, you need to follow these steps:

Here |c| is the number of elements in c.

Clustering using the K-means algorithm begins by initializing all the coordinates to centroids. With every pass of the algorithm, each point is assigned to its nearest centroid based on some distance metric, usually Euclidean distance.

Distance calculation: Note that there are other ways to calculate the distance too, for example:
Chebyshev distance can be used to measure the distance by considering only the most notable dimensions.
The Hamming distance algorithm can identify the difference between two strings. On the other hand, to make the distance metric scale-undeviating, Mahalanobis distance can be used to normalize the covariance matrix. The Manhattan distance is used to measure the distance by considering only axis-aligned directions. The Minkowski distance algorithm is used to make the Euclidean distance, Manhattan distance, and Chebyshev distance. The Haversine distance is used to measure the great-circle distances between two points on a sphere from the location, that is, longitudes and latitudes.

Considering these distance-measuring algorithms, it is clear that the Euclidean distance algorithm would be the most appropriate to solve our purpose of distance calculation in the K-means algorithm. The centroids are then updated to be the centers of all the points assigned to it in that iteration. This repeats until there is a minimal change in the centers. In short, the K-means algorithm is an iterative algorithm and works in two steps:

Cluster assignment step: K-means goes through each of the m data points in the dataset which is assigned to a cluster that is represented by the closest of the k centroids. For each point, the distances to each centroid is then calculated and simply pick the least distant one.
Update step: For each cluster, a new centroid is calculated as the mean of all points in the cluster. From the previous step, we have a set of points which are assigned to a cluster. Now, for each such set, we calculate a mean that we declare a new centroid of the cluster.

Table of Contents for How does K-means algorithm work?

Create new playlist

Sign In

Sign Up

Table of Contents for
How does K-means algorithm work?