Clustering algorithms can be categorized in different ways based on the techniques, the outputs, the process, and other considerations. In this topic, we will present some of the most widely used clustering algorithms.
There is a rich set of clustering techniques in use today for a wide variety of applications. This section presents some of them, explaining how they work, what kind of data they can be used with, and what their advantages and drawbacks are. These include algorithms that are prototype-based, density-based, probabilistic partition-based, hierarchy-based, graph-theory-based, and those based on neural networks.
k-means is a centroid- or prototype-based iterative algorithm that employs partitioning and relocation methods (References [10]). k-means finds clusters of spherical shape depending on the distance metric used, as in the case of Euclidean distance.
k-means can handle mostly numeric features. Many tools provide categorical to numeric transformations, but having a large number of categoricals in the computation can lead to non-optimal clusters. User-defined k, the number of clusters to be found, and the distance metric to use for computing closeness are two basic inputs. k-means generates clusters, association of data to each cluster, and centroids of clusters as the output.
The most common variant known as Lloyd's algorithm initializes k centroids for the given dataset by picking data elements randomly from the set. It assigns each data element to the centroid it is closest to, using some distance metric such as Euclidean distance. It then computes the mean of the data points for each cluster to form the new centroid and the process is repeated until either the maximum number of iterations is reached or there is no change in the centroids.
Mathematically, each step of the clustering can be seen as an optimization step where the equation to optimize is given by:
Here, ci is all points belong to cluster i. The problem of minimizing is classified as NP-hard and hence k-Means has a tendency to get stuck in local optimum.
Density-based spatial clustering of applications with noise (DBSCAN) is a density-based partitioning algorithm. It separates dense region in the space from sparse regions (References [14]).
Only numeric features are used in DBSCAN. The user-defined parameters are MinPts and the neighborhood factor given by ϵ.
The algorithm first finds the ϵ-neighborhood of every point p, given by . A high density region is identified as a region where the number of points in a ϵ-neighborhood is greater than or equal to the given MinPts; the point such a ϵ-neighborhood is defined around is called a core points. Points within the ϵ-neighborhood of a core point are said to be directly reachable. All core points that can in effect be reached by hopping from one directly reachable core point to another point directly reachable from the second point, and so on, are considered to be in the same cluster. Further, any point that has fewer than MinPts in its ϵ-neighborhood, but is directly reachable from a core point, belongs to the same cluster as the core point. These points at the edge of a cluster are called border points. A noise point is any point that is neither a core point nor a border point.
Mean shift is a very effective clustering algorithm in many image, video, and motion detection based datasets (References [11]).
Only numeric features are accepted as data input in the mean shift algorithm. The choice of kernel and the bandwidth of the kernel are user-driven choices that affect the performance. Mean shift generates modes of data points and clusters data around the modes.
Mean shift is based on the statistical concept of kernel density estimation (KDE), which is a probabilistic method to estimate the underlying data distribution from the sample.
A kernel density estimate for kernel K (x) of given bandwidth h is given by:
For n points with dimensionality d. The mean shift algorithm works by moving each data point in the direction of local increasing density. To estimate this direction, gradient is applied to the KDE and the gradient takes the form of:
Here g(x)= –K'(x) is the derivative of the kernel. The vector, m(x), is called the mean shift vector and it is used to move the points in the direction
x(t+1) = xt + m(x)
Also, it is guaranteed to converge when the gradient of the density function is zero. Points that end up in a similar location are marked as clusters belonging to the same region.
GMM or EM is a probabilistic partition-based method that partitions data into k clusters using probability distribution-based techniques (References [13]).
Only numeric features are allowed in EM/GMM. The model parameter is the number of mixture components, given by k.
GMM is a generative method that assumes that there are k Gaussian components, each Gaussian component has a mean µi and covariance Ʃi. The following expression represents the probability of the dataset given the k Gaussian components:
The two-step task of finding the means {µ1, µ2, …µk} for each of the k Gaussian components such that the data points assigned to each maximizes the probability of that component is done using the Expectation Maximization (EM) process.
The iterative process can be defined into an E-step, that computes the expected cluster for all data points for the cluster, in an iteration i:
The M-step maximizes to compute µt+1 given the data points belonging to the cluster:
The EM process can result in GMM convergence into local optimum.
Hierarchical clustering is a connectivity-based method of clustering that is widely used to analyze and explore the data more than it is used as a clustering technique (References [12]). The idea is to iteratively build binary trees either from top or bottom, such that similar points are grouped together. Each level of the tree provides interesting summarization of the data.
Hierarchical clustering generally works on similarity-based transformations and so both categorical and continuous data are accepted. Hierarchical clustering only needs the similarity or distance metric to compute similarity and does not need the number of clusters like in k-means or GMM.
There are many variants of hierarchical clustering, but we will discuss agglomerative clustering. Agglomerative clustering works by first putting all the data elements in their own groups. It then iteratively merges the groups based on the similarity metric used until there is a single group. Each level of the tree or groupings provides unique segmentation of the data and it is up to the analyst to choose the right level that fits the problem domain. Agglomerative clustering is normally visualized using a dendrogram plot, which shows merging of data points at similarity. The popular choices of similarity methods used are:
SOM is a neural network based method that can be viewed as dimensionality reduction, manifold learning, or clustering technique (References [17]). Neurobiological studies show that our brains map different functions to different areas, known as topographic maps, which form the basis of this technique.
Only numeric features are used in SOM. Model parameters consists of distance function, (generally Euclidean distance is used) and the lattice parameters in terms of width and height or number of cells in the lattice.
SOM, also known as Kohonen networks, can be thought of as a two-layer neural network where each output layer is a two-dimensional lattice, arranged in rows and columns and each neuron is fully connected to the input layer.
Like neural networks, the weights are initially generated using random values. The process has three distinct training phases:
The neighborhood size is defined in the way that it decreases with time using some well-known decay functions such as an exponential, function defined as follows:
Here, the learning rate n(t) is again defined as exponential decay like the neighborhood size.
SOM Visualization using Unified Distance Matrix (U-Matrix) creates a single metric of average distance between the weights of the neuron and its neighbors, which then can be visualized in different color intensities. This helps to identify similar neurons in the neighborhood.
Spectral clustering is a partition-based clustering technique using graph theory as its basis (References [15]). It converts the dataset into a connected graph and does graph partitioning to find the clusters. This is a popular method in image processing, motion detection, and some unstructured data-based domains.
Only numeric features are used in spectral clustering. Model parameters such as the choice of kernel, the kernel parameters, the number of eigenvalues to select, and partitioning algorithms such as k-Means must be correctly defined for optimum performance.
The following steps describe how the technique is used in practice:
A simple Laplacian matrix is L = D (degree matrix) – A(affinity matrix).
Affinity propagation can be viewed as an extension of K-medoids method for its similarity with picking exemplars from the data (References [16]). Affinity propagation uses graphs with distance or the similarity matrix and picks all examples in the training data as exemplars. Iterative message passing as affinities between data points automatically detects clusters, the exemplars, and even the number of clusters.
Typically, other than maximum number of iterations, which is common to most algorithms, no input parameters are required.
Two kinds of messages are exchanged between the data points that we will explain first:
a(i, k) = availability of exemplar k for i.
The algorithm can be summed up as follows:
Clustering validation and evaluation is one of the most important mechanisms to determine the usefulness of the algorithms (References [18]). These topics can be broadly classified into two categories:
Internal evaluation uses only the clusters and data information to gather metrics about how good the clustering results are. The applications may have some influence over the choice of the measures. Some algorithms are biased towards particular evaluation metrics. So care must be taken in choosing the right metrics, algorithms, and parameters based on these considerations. Internal evaluation measures are based on different qualities, as mentioned here:
Here's a compact explanation of the notation used in what follows: dataset with all data elements =D, number of data elements =n, dimensions or features of each data element=d, center of entire data D = c, number of clusters = NC, ith cluster = Ci, number of data in the ith cluster =ni, center of ith cluster = ci, variance in the ith cluster = σ(Ci), distance between two points x and y = d (x,y).
The goal is to measure the degree of difference between clusters using the ratio of the sum of squares between clusters to the total sum of squares on the whole data. The formula is given as follows:
The goal is to identify dense and well-separated clusters. The measure is given by maximal values obtained from the following formula:
The external evaluation measures of clustering have similarity to classification metrics using elements from the confusion matrix or using information theoretic metrics from the data and labels. Some of the most commonly used measures are as follows.
Rand index measures the correct decisions made by the clustering algorithm using the following formula:
F-Measure combines the precision and recall measures applied to clustering as given in the following formula:
Here, nij is the number of data elements of class i in the cluster j, nj is the number of data in the cluster j and ni is the number of data in the class i. The higher the F-Measure, the better the clustering quality.
NMI is one of the many entropy-based measures applied to clustering. The entropy associated with a clustering C is a measure of the uncertainty about a cluster picking a data element randomly.
where is the probability of the element getting picked in cluster Ci.
Mutual information between two clusters is given by:
Here , which is the probability of the element being picked by both clusters C and C'.
Normalized mutual information (NMI) has many forms; one is given by:
18.188.200.46