Cluster analysis

The idea behind clustering is very simple—group similar things together. Nonetheless, there are many different ways to perform this simple task and none are one-size fits all. It's hard to narrow down the playing field for clustering. There are countless real-world applications and many more to be unveiled.

Scientists Garibaldi and Wang published in 2005 a paper showing how clustering could aid cancer diagnosis. For a long time, the industry has been using it to draw recommendations, segment markets, and detect fraud. Social media can be found in the hall of traditional uses of clustering.

In this section, we are about to check the practical concerns of running a hierarchical clustering with R. Different than k-means clustering, hierarchical clustering is more appropriate to smaller datasets. Hierarchical clustering also does not depend on seemly random processes—the same result will be given no matter how many times in a row it has rolled.

Package Ckmeans.1d.dp is meant to optimize univariate clustering.

To make things easier to understand, let's stick with only the top 25 tweeted packages. Hierarchical clustering can be simply done with the following code:

tw_clust <- hclust(dist(clean_dt$n[1:25]))
plot(tw_clust, labels = clean_dt$word[1:25])

The first line is inputting hclut() with an object obtained through the dist() method, and the result is the hierarchical cluster. The results are stored in an object called tw_clust (Twitter cluster), which is plotted using plot(). This example demonstrates how hard it's to set visualization and analysis apart from each other. Even if the section Visualizing data was focused on visualization, it was also drawing an analysis known as frequency. This section's analysis could hardly be interpreted without the aid of a dendrogram:

Figure 4.9: Cluster dendrogram

There are two reasons why Figure 4.8 should not be taken strictly seriously:

The data might be polluted—many words, such as fun or population, are being accounted for, even if they don't ever make reference to R packages
There are 25 groups given 25 packages—depending on how a person looks at it, Figure 4.8 shows any number of clusters between 25 and 1

Both puzzles can be tackled either with simpler techniques or more complex ones. As point number one was already discussed before we are moving on to point number two. The way Figure 4.8 is designed, the reader can't tell how many clusters (groups) there are. For the analysis to be complete, we need to select how many groups we are accepting.

A very common procedure is to pick how many groups (k) are wanted and then slicing tree in the proper height to get that many groups. The dendextend library grants useful methods to address this division while also drawing the visualization. Install and load the library:

if(!require(dendextend)){ install.packages('dendextend')}
library(dendextend)

Once we've properly installed dendextend, it's time for us to craft a brand new dendrogram. Try the following code:

tw_clust %>% as.dendrogram() %>%
  set('labels', as.character(clean_dt$word[1:25])) %>%
  set('labels_col', k = 3, 
      value = c('#e66101','#fdb863','#b2abd2')) %>%
  set('branches_k_color', k = 3,
      value = c('#e66101','#fdb863','#b2abd2')) %>%
  set('labels_cex', 1.1) %>%
  plot()

This package practically works by chaining the set() command after a dendrogram object—that is why the first chain is converting our tw_clust object to a dendrogram. The first set() is setting the labels. The next ones are respectively changing the labels' colors, the branches' colors, and the labels' sizes. The following is the result:

Figure 4.10: Colored dendrogram (k = 3)

A very common rule of thumb tells that k—the number of groups—should be equal to 3. Our dendrogram is relying on this rule but most important is how your cluster performs in the designated task. Assign a test and validation datasets, and test how your model performs there. Use the cutree() function to do the separation. The cutting decision could be made either by groups (k) or height (h):

cutree(tw_clust, k = 3)
cutree(tw_clust, h = 250)

How would you interpret the results given in Figure 4.9? Don't surrender to urge to see it as packages that may come together or something like this. They are not clustered by what they can do either. Packages they are grouped by seemed popular—since the clustering stems from a term frequency analysis.

Table of Contents for Cluster analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Cluster analysis