K-means clustering algorithm

The name of the k-means clustering algorithm comes from the fact that it tries to create a number of clusters, k, calculating the means to find the closeness between the data points. It uses a relatively simple clustering approach, but is still popular because of its scalability and speed. Algorithmically, k-means clustering uses an iterative logic that moves the centers of the clusters until they reflect the most representative data point of the grouping they belong to.

It is important to note that k-means algorithms lack one of the very basic functionalities needed for clustering. That missing functionality is that for a given dataset, the k-means algorithm cannot determine the most appropriate number of clusters. The most appropriate number of clusters, k, is dependent on the number of natural groupings in a particular dataset. The philosophy behind this omission is to keep the algorithm as simple as possible, maximizing its performance. This lean-and-mean design makes k-means suitable for larger datasets. The assumption is that an external mechanism will be used to calculate k. The best way to determine k will depend on the problem we are trying to solve. In some cases, k is directly specified by the clustering problem's context—for example, if we want to divide a class of data-science students into two clusters, one consisting of the students with the data science skill and the other with programming skills, then k will be two. In some other problems, the value of k may not be obvious. In such cases, an iterative trial-and-error procedure or a heuristic-based algorithm will have to be used to estimate the most appropriate number of clusters for a given dataset.

Table of Contents for K-means clustering algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
K-means clustering algorithm