Clustering

Both clustering and dimensionality reduction summarize the data. As just discussed in detail, dimensionality reduction compresses the data by representing it using new, fewer features that capture the most relevant information. Clustering algorithms, by contrast, assign existing observations to subgroups that consist of similar data points.

Clustering can serve to better understand the data through the lens of categories learned from continuous variables. It also permits automatically categorizing new objects according to the learned criteria. Examples of related applications include hierarchical taxonomies, medical diagnostics, and customer segmentation.

Alternatively, clusters can be used to represent groups as prototypes, using (for example) the midpoint of a cluster as the best representative of learned grouping. An example application includes image compression.

Clustering algorithms differ with respect to their strategies for identifying groupings:

  • Combinatorial algorithms select the most coherent of different groupings of observations
  • Probabilistic modeling estimates distributions that most likely generated the clusters
  • Hierarchical clustering finds a sequence of nested clusters that optimizes coherence at any given stage

Algorithms also differ in their notion of what constitutes a useful collection of objects, which needs to match the data characteristics, domain, and the goal of the applications. Types of groupings include the following:

  • Clearly separated groups of various shapes
  • Prototype-based or center-based compact clusters
  • Density-based clusters of arbitrary shape
  • Connectivity-based or graph-based clusters

Important additional aspects of a clustering algorithm include the following:

  • Whether it requires exclusive cluster membership
  • Whether it makes hard (binary) or soft (probabilistic) assignment
  • Whether it is complete and assigns all data points to clusters

The following sections introduce key algorithms, including k-Means, hierarchical, and density-based clustering, as well as Gaussian mixture models. The clustering_algos notebook compares the performance of these algorithms on different, labeled datasets to highlight their strengths and weaknesses. It uses mutual information (see Chapter 6, The Machine Learning Process) to measure the congruence of cluster assignments and labels.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.199.210