Identifying groups using clustering methods

Clustering methods are designed to find hidden patterns or groupings in a dataset. Unlike the supervised learning methods covered in previous sections (regression and classification tasks), these algorithms identify a grouping without any label to learn from. They do so through the selection of clusters based on similarities between elements.

This is an unsupervised learning technique that groups statistical units to minimize the intra-group distance and maximize the inter-group distance. The distance between the groups is quantified by means of similarity/dissimilarity measures defined between the statistical units. In the following graph, four clusters are identified in a specific data distribution:

To perform cluster analysis, no prior interpretative model is required. In fact, unlike other multivariate statistical techniques, this one does not make an a priori assumption on the existing fundamental typologies that may characterize the observed sample. The cluster analysis technique has an exploratory role to look for existing but not-yet-identified structures in order to deduce the most likely group. This analysis is in fact a purely empirical method of classification, and as such, in the first place, it is an inductive technique.

In clustering, as in classification, we are interested in finding the law that allows us to assign observations to the correct class. But unlike classification, we also have to find a plausible subdivision of our classes.

While in classification we have some help from the target (the classification provided in the training set), in the case of clustering, we cannot rely on any additional information and we have to deduce the classes by studying the spatial distribution of data.

The areas where data is thickened correspond to similar observation groups. If we can identify observations that are similar to each other and at the same time different from those of another cluster, we can assume that these two clusters match different conditions. At this point, there are two things we need to go more deeply into:

  • How to measure similarity?
  • How to define a grouping?

The concept of distance and how to define a group are the two ingredients that describe a clustering algorithm. To approach a clustering problem, several methods are available; some of these are listed here:

  • Hierarchical clustering
  • K-means method
  • K-medoids method
  • Gaussian mixture models

Clustering involves identifying groupings of data. This is possible thanks to the measure of the proximity between the elements. The term proximity is used to refer to either similarity or dissimilarity. So, a group of data can be defined once you have chosen how to define the concept of similarity or dissimilarity. In many approaches, this proximity is conceived in terms of distance in a multidimensional space.

By the term similarity between two objects, we refer to the numerical measure of the degree to which the two objects are alike. On the contrary, by the term dissimilarity between two objects, we refer to the numerical measure of the degree to which two objects are unlike.

Similarities/dissimilarities between data objects can be measured by distance. Distances are dissimilarities with certain properties. For example, we can measure distance as Euclidean distance, but distance can be measured in so many other ways too. Some of them are listed here:

  • Minkowski distance metric
  • Manhattan distance metric
  • Cosine distance metric

Once you have chosen a way to calculate the distance, you have to decide how to form groups. Two main algorithm families can be identified:

  • Hierarchical clustering works by creating a data hierarchy: The data is described through a taxonomic tree, similar to those used in biology
  • Partitioning clustering works by making a data space partition: The data space is divided into many subzones; the union of all the subzones gives full space, and one subzone is not superimposed onto other subzones

To summarize, in cluster analysis, we will try to maximize the similarity of intra-cluster (internal homogeneity) and to minimize the similarity of inter-cluster (external separation).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.114.221