Distance-based clustering algorithms

The most known representatives of this family of methods are the k-means and k-medoids algorithms. They take the k input parameter and divide the data space into k clusters, such that the similarity between objects in one cluster is maximal. Also, they minimize the similarity between objects of different clusters. The similarity value is calculated as the distance from the object to the cluster center. The main difference between these methods lies in the way the cluster center is defined.

With the k-means algorithm, the similarity is proportional to the distance to the cluster center of mass. The cluster center of mass is the average value of cluster objects' coordinates in the data space. The k-means algorithm can be briefly described with the following steps. At first, we select k random objects and define each of them as a cluster prototype that represents the cluster's center of mass. Then, remaining objects are attached to the cluster with greater similarity. After that, the center of mass of each cluster is recalculated. For each obtained partition, a particular evaluation function is calculated, the values of which at each step form a converging series. The process continues until the specified series converges to its limit value.

In other words, moving objects from one cluster to another cluster ends when the clusters remain unchanged. Minimizing the evaluation function allows the resulting clusters to be as compact and separate as possible. The k-means method works well when clusters are compact clouds that are significantly separated from each other. It is useful for processing large amounts of data but is not applicable for detecting clusters of non-convex shapes or clusters with very different sizes. Moreover, the method is susceptible to noise and isolated points, since even a small number of such points can significantly affect the calculation of the center mass of the cluster.

To reduce the influence of noise and isolated points on the clustering result, the k-medoids algorithm, in contrast to the k-means algorithm, uses one of the cluster objects (named representative object) as the center of the cluster. As in the k-means method, k representative objects are selected at random. Each of the remaining objects is combined into a cluster with the nearest representative object. Then, each representative object is iteratively replaced with an arbitrary unrepresentative object from the data space. The replacement process continues until the quality of the resulting clusters improves. The clustering quality is determined by the sum of deviations between objects and the representative object of the corresponding cluster, which the method tries to minimize. Thus, the iterations continue until the representative object in each of clusters becomes the medoid. A medoid is the object closest to the center of the cluster. The algorithm is poorly scalable for processing large amounts of data, but this problem is solved by the CLARANS (Clustering Large Applications based on RANdomized Search) algorithm, which complements the k-medoids method. For multidimensional clustering, the PROCLUS (Projected Clustering) algorithm is constructed.

Table of Contents for Distance-based clustering algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Distance-based clustering algorithms