Mixture models and clustering

Clustering is part of the unsupervised family of statistical or machine learning tasks and is similar to classification, but a little bit more difficult since we do not know the correct labels!

Clustering or cluster analysis is the data analysis task of grouping objects in such a way that objects in a given group are closer to each other than to those in the other groups. The groups are called clusters and the degree of closeness can be computed in many different ways; for example, by using metrics, such as the Euclidean distance. If instead we take the probabilistic route, then a mixture model arises as a natural candidate to solve clustering tasks.

Performing clustering using probabilistic models is usually known as model-based clustering. Using a probabilistic model allows us to compute the probability of each data point belonging to each one of the clusters. This is known as soft-clustering as opposed to hard-clustering, where each data point belongs to a cluster with a probability of 0 or 1. We can turn soft-clustering into hard-clustering by introducing some rule or boundary, in fact you may remember that this is exactly what we do to turn logistics regression into a classification method, where we use as the default boundary the value of 0.5. For clustering, a reasonable choice is to assign a data point to the cluster with the highest probability.

In summary, when people talk about clustering, they are generally taking about grouping objects and when people talk about mixture models, they talk about using a mix of simple distributions to model a more complex distribution, either to identify subgroups or just to have a more flexible model to describe the data.

Table of Contents for Mixture models and clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Mixture models and clustering