Clustering

Both clustering and dimensionality reduction summarize the data. As just discussed in detail, dimensionality reduction compresses the data by representing it using new, fewer features that capture the most relevant information. Clustering algorithms, by contrast, assign existing observations to subgroups that consist of similar data points.

Clustering can serve to better understand the data through the lens of categories learned from continuous variables. It also permits automatically categorizing new objects according to the learned criteria. Examples of related applications include hierarchical taxonomies, medical diagnostics, and customer segmentation.

Alternatively, clusters can be used to represent groups as prototypes, using (for example) the midpoint of a cluster as the best representative of learned grouping. An example application includes image compression.

Clustering algorithms differ with respect to their strategies for identifying groupings:

Combinatorial algorithms select the most coherent of different groupings of observations
Probabilistic modeling estimates distributions that most likely generated the clusters
Hierarchical clustering finds a sequence of nested clusters that optimizes coherence at any given stage

Algorithms also differ in their notion of what constitutes a useful collection of objects, which needs to match the data characteristics, domain, and the goal of the applications. Types of groupings include the following:

Clearly separated groups of various shapes
Prototype-based or center-based compact clusters
Density-based clusters of arbitrary shape
Connectivity-based or graph-based clusters

Important additional aspects of a clustering algorithm include the following:

Whether it requires exclusive cluster membership
Whether it makes hard (binary) or soft (probabilistic) assignment
Whether it is complete and assigns all data points to clusters

The following sections introduce key algorithms, including k-Means, hierarchical, and density-based clustering, as well as Gaussian mixture models. The clustering_algos notebook compares the performance of these algorithms on different, labeled datasets to highlight their strengths and weaknesses. It uses mutual information (see Chapter 6, The Machine Learning Process) to measure the congruence of cluster assignments and labels.

Table of Contents for Clustering

Create new playlist

Sign In

Sign Up

Table of Contents for
Clustering