Finding groups with clustering

Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects.

There are a large number of clustering algorithms. The major methods can be classified into the following categories:

  • Partitioning methods: A partitioning method constructs K partitions of the data, which satisfy both of the following requirements:
    1. Each group must contain at least one object.
    2. Each object must belong to exactly one group. Given the initial K number of partitions to construct, the method creates initial partitions. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. There are various kinds of criteria for judging the quality of the partitions. Some of the most popular include the K-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and the K-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. You can see an example of K-means clustering in the previous chapter. The example used the scalable rxKmeans() function from the RevoScaleR package; you can use this function on very large datasets as well.
  • Hierarchical methods: A hierarchical method creates a hierarchical decomposition of a given set of data objects. These methods are agglomerative or divisive. The agglomerative (bottom-up) approach starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all groups are merged into one. The divisive (top-down) approach starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster or until a termination condition holds.
  • Density-based methods: Methods based on the distance between objects can find only spherical-shaped clusters and encounter difficulty in discovering clusters of arbitrary shapes. So other methods have been developed based on the notion of density. The general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the "neighborhood" exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
  • Model-based methods: Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model-based technique might locate clusters by constructing a density function that reflects the spatial distribution of the data points. Unlike conventional clustering, which primarily identifies groups of like objects, this conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or a class.

Hierarchical clustering model training typically starts by calculating a distance matrix—a matrix with distances between data points in a multidimensional hyperspace, where each input variable defines one dimension of that hyperspace. Distance measure can be a geometrical distance or some other, more complex measure. A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are also often used in computational biology to illustrate the clustering of genes or samples. The following figure shows the process of building an agglomerative hierarchical clustering dendrogram from six cases in a two-dimensional space (two input variables) in six steps:

Hierarchical clustering process

The following code creates a subset of the data with 50 randomly selected rows and numerical columns only. You cannot show a dendrogram with thousands of cases. Hierarchical clustering is suitable for small datasets only:

TM50 <- TM[sample(1:nrow(TM), 50, replace=FALSE), 
           c("TotalChildren", "NumberChildrenAtHome",  
             "HouseOwnerFlag", "NumberCarsOwned",  
             "BikeBuyer", "YearlyIncome", "Age")]; 

Then you can calculate the distance matrix with the dist() function and create a hierarchical clustering model with the hclust() function:

ds <- dist(TM50, method = "euclidean"); 
TMCL <- hclust(ds, method="ward.D2"); 

In a dendrogram, you can easily see how many clusters you should use. You make a cut where the difference in the distance is the highest. You can plot the model to see the dendrogram. The following code plots the model, and then uses the cutree() function to define two clusters, and the rect.hclust() function to draw two rectangles around the two clusters:

plot(TMCL, xlab = NULL, ylab = NULL); 
groups <- cutree(TMCL, k = 2); 
rect.hclust(TMCL, k = 2, border = "red"); 

You can see the final result in the following screenshot:

A dendrogram with two clusters

The dendrogram shows the growth of the clusters, and how the cases were associated together. Please note that the decision to cut the population into two clusters is arbitrary: a cut into three or five clusters would work as well. You decide how many clusters to use based on your business perspective because the clusters must be meaningful for you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.242.157