Chapter 4. Cluster Analysis

Unsupervised cluster analysis refers to algorithms that aim at producing homogeneous groups of cases from unlabeled data. The algorithm doesn't know beforehand what the membership to the groups is, and its goal is to find the structure of the data from similarities (or differences) between the cases; a cluster is a group of cases, observations, individuals, or other units, that are similar to each other on the considered characteristics. These characteristics can be anything measurable or observable. The choice of characteristics, or attributes, is important as different attributes will lead to different clusters.

In this chapter, we will discuss the following topics:

  • Distance measures
  • Partition clustering with k-means, including the steps in the computations of clusters, and the selection of the best number of clusters
  • Applications of k-means clustering

Clustering algorithms use distance measures between the cases in order to create these homogeneous groups of cases (we will discuss this next). It is therefore important to transform the data on all dimensions to a similar scale before performing partition clustering with tools such as kmeans(). This is important because the distances are computed from all the dimensions we consider. If one dimension has a range of values higher than the others (for example, values in centimeters compared to meters), differences in this dimension will be given much more importance compared to the others. This is also true for agglomerative clustering, with tools such as hclust(), or any algorithm using distance measures, such as nearest neighbor classification, which we will discover in an upcoming chapter.

There are several ways of scaling the dimensions for this purpose. Examples include:

  • In case of comparable units (for example, centimeter and meter), the transformation of one metric to the other (for example, dividing the attribute measured in centimeters by 100).
  • Normalizing, which is subtracting from each observation the lowest value and performing the division of the result by the difference between the maximum and the minimum. The mathematical equation is provided next:
    Cluster Analysis
  • The use of z scores, which can be computed by subtracting the mean from the value of each observation, and dividing the result by the standard deviation. The mathematical equation is provided next:
    Cluster Analysis

    This should be preferred for most cases, and can be done easily in R using the scale() function.

  • When using scores based on textual data, one can compute the ratio of the frequency of appearance of each term (relative to the total number of words) in each document multiplied by the ratio of the total number of documents over the inverse of the number of documents featuring the considered term. This is known as the tf-idf ratio. In the equations, t is the term under consideration, d is a specific document, D is the corpus containing all documents, and N is the number of documents in D:
    Cluster Analysis

Distance measures

Partitioning clustering algorithms iteratively define k cluster centers and assign cluster membership (or the probability of group membership) to cases based on distances between the case and the cluster. Agglomerative clustering algorithms also create clusters based on distances, starting with each individual belonging to a separate cluster and the grouping clusters two by two. The k-nearest neighbors algorithm also uses distance measures.

Consider only one attribute, for instance the height of individuals. The distance of someone measuring 180 cm and someone measuring 170 cm will be 10 on this sole dimension considering the algebraic difference between the two measures as our distance metric. Things get a little more complicated when we add more attributes, such as weight (we will not consider variable scaling here). Let's say the first individual is clearly overweight (90 kg), and the second has a normal weight (80 kg). Considering only the sum of the difference between the measures as our distance metric, the difference between the individuals would be: (180-170) + (90-100) = 0. This clearly doesn't reflect the huge differences between these individuals; one is bigger and slimmer than the other. Several distance metrics are available. Here are some examples:

The metrics closest to the sum of differences measure we just examined are the Euclidean and the Manhattan distances.

The Manhattan distance sums the absolute value of the differences on all considered dimensions. Its mathematical equation is provided next:

Distance measures

Take the case of our previous example, abs(180-170)+abs(90-100) = 20. For one dimension, it is equivalent to the difference between two observations: abs(180-170) = 180-170 = 10. The Manhattan distance can be selected in hclust(), which we will discover later, but not in kmeans().

The Euclidean distance sums the squares of the differences and then performs a square root on the result. In case of only one dimension, the result is equal to the difference between observations. Its mathematical equation is provided below:

Distance measures

Considering our previous example, sqrt((170-180)^2) = 10, just as 180-170. The Euclidean distance is the only distance available in kmeans() from the stats package (provided by default in R). We will only work with this function here (and one of our own in order to better understand how k-means works). But, in case you need to run k-means using other distances, you might be happy to know that the Kmeans() function from the amap package allows to select other distances.

The cosine similarity is another important distance (similarity measure). It is used in information retrieval and text mining. It is computed by performing the ratio of the dot product of the considered dimensions and the product of the square root of the sum of the squared values on the dimensions. The mathematical equation is provided below:

Distance measures

Tip

The skmeans() function from the skmeans package allows one to run k-means with cosine similarity as the distance metric, but not kmeans(), the standard implementation of k-means in R.

The correlation coefficient, an association measure, can also be used as a distance measure. Its mathematical formula is presented below.

Distance measures

The correlation coefficient can only be used with two dimensions.

The Jaccard index measures the similarity between two sets. It is used for categorical attributes. It is computed by dividing the number of elements that are common between two sets over the sum of common elements, elements only in the first set and elements only in the second set. For instance if A, B, C, and D are in set 1, C, D, E, and F are in set 2, and X, Y, Z, and A are in set 3, the similarity of set 1 and set 2 will be 0.33: 2/(2+2+2). On this metric, the similarity between set 1 and set 3 will be 0.14: 1/(1+3+3). Its mathematical formula for two sets (A and B) is presented below:

Distance measures

The Jaccard index is not readily accessible either in Kmeans(), but hclust() uses it as a custom distance matrix, which can be constructed in this case with the vegdist() function from the vegan package (along with other distance measures). Distance matrices for common distance metrics can be computed using the dist() function from the stats package (provided with R).

Readers interested in distance measures might find having a look at the Encyclopedia of Distances, by Michel Marie Deza and Elena Deza (Second edition, 2013) worthwhile.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.179.220