CHAPTER 10: Unsupervised Machine Learning

Chapter010.jpg

One of the fundamentals of data science is finding patterns in the data at hand. As you may remember from Chapter 5, this is accomplished mainly during the data exploration part of the pipeline. Although the various statistical methods and plots yield much useful insight, sometimes we must employ a more sophisticated approach. This is particularly useful if we have certain expectations about the data (such as different groups or patterns that should exist) and wish to explore them further.

This lends itself to what is known as unsupervised learning: the kind of machine learning where labels are unknown or nonexistent altogether. The aim of this strategy is to find these labels in a way that makes sense for the data at hand, while also uncovering the structure of the dataset. In this chapter we’ll explore this topic in some depth to examine how unsupervised learning can be done in Julia effectively.

In this chapter we will examine the following topics:

  • The basics of unsupervised learning, including the underlying distance metrics
  • The K-means method for clustering
  • The density concept and the DBSCAN clustering technique
  • Hierarchical clustering
  • Validation metrics for clustering
  • Tips for applying unsupervised learning effectively.

Before we continue, make sure that the Clustering package is installed on your system.

Unsupervised Learning Basics

Unsupervised learning is extremely important for a variety of reasons. First of all, it provides insight about patterns in the data that you would not be able to pinpoint with other methods. This is particularly useful during the data exploration stage where you are uncertain what signals the data contains. You could find these signals with other means, but it would take a significant amount of additional effort.

This brings us to the next point, which is the amount of time unsupervised learning saves. This extends beyond data exploration to something that’s even more important when it comes to productionalizing your work. A classic example is figuring out the labels of a dataset, which is possible with one of the methods of unsupervised learning. There are numerous ways of finding labels (the most obvious of which is hiring some people to do it manually), but these methods don’t scale well. Unsupervised learning is both fast and scalable, while it is cheap comparatively.

Finally, this machine learning approach allows us to organize the data at hand and understand it better. At the same time we can make more meaningful plots that can shed additional light on the problem we are trying to solve. And unlike conventional plots, unsupervised learning techniques provide you with a variety of metrics. There are several unsupervised learning techniques, all of which have their niche. The most important of these are the following:

  • Clustering, a group of processes that identify groups in the data
  • Association rule discovery, a method that identifies collections of features values that appear often together
  • Mixture decomposition, a statistical methodology that aims to identify parametric densities of individual populations constituting a “superpopulation.”

There are several other methods that qualify as unsupervised learning, though they are seldom encountered in data science projects. In this chapter we will focus on the first type of unsupervised learning, as it is the most commonly used and is crucial in providing actionable information about the data at hand.

Clustering types

There are two main types of clustering, depending on the inner workings of the clustering method: partitional and hierarchical. The first type splits the data into a number of mutually exclusive groups (usually denoted as K), while the second one creates a tree where the most similar elements are grouped together gradually. Partitional clustering is particularly useful for creating a labels vector that could be used later on for classification. Both partitional and hierarchical clustering are excellent tools for data exploration.

Apart from this categorization, there are other ways to differentiate the clustering methods. One way to categorize clustering methods is as hard or as fuzzy, depending on whether a data point belongs to a cluster exclusively or to some degree. Another descriptor is deterministic or stochastic, depending on whether the algorithm used employs randomness or not.

In all the clustering methods, the points compared are evaluated as to how similar (or dissimilar) they are to each other, in order to determine whether they belong to the same group or not. For this we usually use a distance metric, such as Euclidean distance. However, there are several other alternatives, as we’ll see in the following section.

Distance metrics

There are several metrics that can be used to express the distance between two data points, the most important of which are listed below. For all the examples we make use of two float vectors x and y of equal dimensionality. In order to utilize these metrics in Julia, you’ll need to add and load the Distances package (check http://bit.ly/29vRs0A for more information).

  • Euclidean distance. This is the default distance metric, which works well for small dimensionalities. It is always non-negative and without an upper limit. It is good practice to normalize its inputs, to avoid biased results.
  • Cosine distance. This is the corresponding distance metric to the cosine similarity measure we have seen previously: cos_dist = 1 - cos_sim. It is good for cases where there are vast differences in particular parts of the vectors compared, and works well with all kinds of dimensionalities. Cosine distance takes values between 0 and 2, inclusive, and doesn’t require normalization beforehand.
  • Manhattan distance (aka City-Block distance). This is very much like the Euclidean distance but it uses absolute values instead of squares, when computing the distances in each dimension. It takes non-negative values without an upper limit, and just like Euclidean distance, it is a good idea to normalize its inputs before calculating this distance metric.
  • Jaccard distance. This is an interesting metric that works not only on numeric data but also on Boolean data. It is defined as JD = 1 - J(x,y), where J is the Jaccard similarity we have seen in a previous chapter. The above formula is equivalent to JD = 1 - sum(min(x, y)) / sum(max(x, y)). Jaccard distance takes values between 0 and 1, inclusive. Just like cosine distance, this distance metric doesn’t require normalization beforehand.
  • Other. There are several other distances you can use, depending on the data you have. You can view them in the documentation of the Distances.jl package: http://bit.ly/29vRs0A.

So, if we have a couple of vectors x = [1.0, 2.0, 3.0] and y = [-1.0, 2.5, 0.0], we could calculate their distance in Julia using some of the above metrics:

In[1]: d = evaluate(Euclidean(), x, y)

Out[1]: 3.640054944640259

In[2]: d = evaluate(CosineDist(), x, y)

Out[2]: 0.6029666664116279

In[3]: d = evaluate(Cityblock(), x, y)  

Out[3]: 5.5

In[4]: d = evaluate(Jaccard(), x, y)  

Out[4]: 0.8461538461538461

Grouping Data with K-means

K-means is probably the most commonly used clustering technique, due to its simplicity and speed. Using a single parameter, K, it splits the data into K different groups (referred to as clusters), based on how close each data point is to the cluster center. The cluster centers change in every iteration of the algorithm, based on the data points that belong to these clusters; they are usually calculated as the means of the values of the data points in them (hence the name of the algorithm). The original clusters are defined randomly, making every run of the K-means algorithm yield slightly different results. The algorithm stops once the cluster centers don’t change significantly or until a maximum number of iterations has been reached.

Although in the original version of K-means the cluster centers are calculated based on the mean of the data points that belong to them, there are a few variants of the method where different averaging metrics (such as the median) are used. Also, there is a powerful variant of K-means that uses fuzzy logic, so each data point belongs to all possible clusters with some membership value, known as C-means. Although these variations are tempting, we’ll limit ourselves to the most basic method of K-means in this section, as this is the one most often used in practice. In Figure 10.1 you can see a typical output of the K-means algorithm (which is representative of other partitional clustering algorithms, too).

Image029.jpg

Figure 10.1 An example of a typical K-means clustering output.

K-means using Julia

For K-means in Julia we’ll be using the Clustering package, which also includes some other clustering techniques (see http://bit.ly/29vRs0A for details on the package’s contents). Once you add it and load it into memory you can start using it on your data.

Let’s look at how we can apply it to the OnlineNewsPopularity dataset. Here we will work under the hypothesis that there are two distinct categories of news articles (the popular ones and the not-so-popular ones), so we’ll use K = 2:

In[5]: using Clustering

In[6]: data = readdlm(“d://data//OnlineNewsPopularity//OnlineNewsPopularity.csv”, ‘,’);

In[7]: F = map(float, data[2:end,2:60]) #make all features floats

In[8]: F = normalize(F, “linear”)

In[9]: Z = kmeans(F’, 2)

We need to transpose the inputs (features) since the kmeans() function takes its data in a different form (each row being a different feature and each column a data point). Also, since there are several outputs, it is best to put them all in a single data structure (variable Z) so that later we can retrieve them individually through that variable. Specifically, if we were interested in the cluster labels, we would type:

In[10]: labels = assignments(Z); show(labels[1:10])

    [1,1,1,1,1,1,1,1,1,2]

Of course, there are several more values in the labels vector, but here we just view the first ten to get an idea of what the labels vector looks like. The numbers that correspond to the two clusters are arbitrarily chosen by the algorithm and may not be the same in the next run. It is often the case that even if the clusters are more or less identical, the same data points that are shown here as 1s may be later labeled as 2s.

We could manually count all these labels to get an idea of how large each cluster is. Thankfully, the package has a specialized function, counts(), that you can apply to accomplish this task easily:

In[11]: c = counts(Z); show(c)

    [31208,8436]

What about the centers of the clusters? We can obtain them from the outputs data structure with the .centers identifier:

In[12]: C = Z.centers; show(C[1:5,:])

    [0.505558031543492 0.3821391822047006

    0.39736453412433603 0.40946961999592846

    0.06286506180190116 0.07051594966020513

    0.0007965795130288157 0.00072829082065273

    0.0009634138252223355 0.0009300011157573742]

Naturally, there are several more rows in the centers variable (C): one for each feature of the dataset. However, due to space limitations, we show here just the first five rows, for the two cluster centers. By examining the cluster centers more closely we can see that certain variables (e.g. the first one, timedelta) play a more important role in keeping the clusters distant, as the difference in the centers’ coordinates in the corresponding dimensions is larger. This may provide additional insight, as those variables seem to divulge more useful information related to the dataset’s structure. We would expect, then, that these variables will work better as predictors later on.

K-means tips

Although the K-means method works well in the majority of situations, it fails to produce meaningful results in some cases. Specifically, when the number of clusters (K) is off, the resulting labels may be nonsensical. So, before you run this algorithm (or any of its variants), be sure to have a good idea of what you are looking for.

In a sentiment analysis problem, for example, a good value for K would be 2 (positive and negative sentiment), or 3 (positive, negative, and neutral sentiment). Asking K-means to provide clusters for K = 10 for this problem may not make much sense, which would be reflected in the very small differences in the coordinates of the cluster centers. It is a good idea to carefully consider how you will interpret the results, since Julia can only help you so much with your partitional clustering endeavors.

Density and the DBSCAN Approach

Although it is one of the most fundamental concepts in physics, density hasn’t been properly implemented in the world of data science, with many people still confusing it with probability density. In essence, density kernels are heuristics that approximate the true density of a particular data point. Unless you already know a lot about the dataset, finding the optimum parameters of such a kernel is a challenge.

With that being said, even the crude implementations of this concept manage to work decently and make a new generation of clustering algorithms possible (although density has the potential to enhance the data analytics field).

The aim of density is to provide a reliable measure of how crowded a particular part of the feature-space is. Naturally, the more data points there are in a particular area, the higher the density and the lower the entropy of the dataset, as it is closely linked to order. Density is basically the most direct way of pinpointing a signal in the data, and it is based on tangible measurements.

In clustering, density is used in algorithms like DBSCAN, which we will cover in more depth in the next section. As density is closely linked to distances, you need to have a distance matrix calculation function handy, like the one we provide in the corresponding notebook of this chapter.

DBSCAN algorithm

DBSCAN is probably the most well-known alternative to K-means. It stands for Density Based Spatial Clustering of Applications with Noise, and is a very robust partitional clustering method. The fundamental premise behind this algorithm is that we expect every cluster to be defined by a probability distribution that is lighter near the edges and darker in the center.

This translates into a higher probability of picking a point near the center of a cluster than near a cluster’s boundaries. This simple idea has powerful implications, as it can filter out the noisy elements of the dataset and construct the clusters based on the most concrete (i.e. dense) ones.

DBSCAN takes as its input the distance matrix of the dataset’s points, along with two parameters, d and K. The first parameter corresponds to how far around a data point we want to look for the density estimation. (There is an objective way to pick this parameter for all density estimations, but as there is no scientific reference for it, we’ll leave it out.) Naturally, d is a float.

The second parameter represents the density threshold that distinguishes dense from sparse data points (this too can be objectively defined, but there is not scientific canon for it either). Specifically, it refers to the number of data points in the area examined (i.e. K is an integer). Both d and K relate to the data at hand; there are no default values recommended by the creators of the algorithm.

Unlike K-means, DBSCAN does not require as a parameter the number of clusters, as it finds the optimum partitioning based on the density values of the various data points.

Applying DBSCAN in Julia

Let us now see how we can apply DBSCAN in Julia (using the Clustering package as before). Due to potential memory limitations, we are going to work with just the first 10,000 data points of the dataset. First, let’s calculate the distance matrix of these points:

In[13]: D = dmatrix(F[1:10000,:]);

Now, let’s find their average distance:

In[14]: mean(D)

Out[14]: 4.874257821045708

It would be a good idea to use a lower d parameter (e.g. 3.0). As for the K parameter, a value around the total number of dimensions should work well (e.g. 60). So, we can apply DBSCAN as follows:

In[15]: Z = dbscan(D, 3.0, 60)  

Let’s see what the labels look like by using the same functions as in the K-means example:

In[16]: labels = assignments(Z); show(labels[1:10])

    [1,1,1,1,1,1,1,1,1,1]

It looks like the first cluster is dominant. Let’s see what the cluster sizes are like:

In[17]: c = counts(Z); show(c)

    [8699,1149]

So, for this two-cluster model, the results are similar to the results of the K-means method. Should we change the parameters, though, we may discover a different cluster setup.

Hierarchical Clustering

Hierarchical clustering doesn’t assume a specific number of clusters, like most algorithms of partitioning clustering. A valuable data exploration tool, it examines how the data points relate to each other and constructs a tree showing how these relationships apply to larger and larger groups, starting from a single data point. This yields a graph with the various points of the dataset connected in pairs, with the most similar ones linked together. You can see a typical hierarchical clustering output in Figure 10.2.

Image030.jpg

Figure 10.2 A typical hierarchical clustering tree output.

Although the output of a hierarchical clustering algorithm is not all that useful in finding the labels of the data at hand, it can be useful in understanding the data. It can potentially indicate the optimum number of clusters (i.e. pinpointing the optimal K parameter in K-means). Due to the limitations of the plotting package used on the back end (PyPlot), visuals of the cluster hierarchy are visible only for two dimensional data. Let’s see how we can apply this clustering technique in Julia using the corresponding package.

Applying hierarchical clustering in Julia

To perform hierarchical clustering we’ll need to make use of the QuickShiftClustering package. First let’s load this package, as well as the PyPlot package, which is used for plotting the hierarchical tree:

In[18]: using QuickShiftClustering

    using PyPlot

Since we’ll need to do some dimensionality reduction in order to view a meaningful plot based on the clustering, we’ll also make use of the MultivariateStats package:

In[19]: using MultivariateStats

Now we can create the tree for the hierarchical clustering. To avoid spending too much time waiting for the results (particularly in the final stage of the process), we’ll limit the data used to a sample of 300 data points and 2 features. First, we’ll get the first two meta-features using a PCA model:

In[20]: M = fit(PCA, F’; maxoutdim = 2)

Next, we’ll create a transformed sample of the dataset, having two dimensions (the above features of the PCA model) and 300 data points:

In[21]: G = transform(M, F[1:300,:]’)’

Afterward, we can run it through the quickshift() function to get a hierarchical clustering model:

In[22]: Z = quickshift(G’)

We can get the labels of the clustering using the quickshiftlabels() function:

In[23]: labels = quickshiftlabels(Z);

The output of this function is not particularly meaningful, since it is used mainly in creating the plot that follows. To create this plot, we’ll make use of the quickshiftplot() function:

In[24]: quickshiftplot(Z, G’, labels)

You can see the various links of the data points and how they all merge together to form larger and larger groups of data, with ever-bigger links forming, until everything in the dataset is connected (Figure 10.3). It becomes clear that the groupings we saw in the previous clustering approaches are meaningful after all, since there are two distinct clusters of data points (possibly corresponding to the normal and viral news, respectively).

Image031.jpg

Figure 10.3 An output plot for the QuickShiftClustering clustering model.

A word of caution: although the Clustering package also includes methods to handle hierarchical clustering (hclust() and cutree() functions), they are still works-in-progress; at the time of this writing they don’t perform as expected. We recommend you avoid using both until they have become fully functional and have some documentation on the package web page.

When to use hierarchical clustering

Although not as popular as K-means or DBSCAN, this is a particularly useful clustering strategy for a few use cases where partitional methods cannot deliver. Namely, it is ideal for data exploration and for parameter searching for other clustering algorithms. For example, if you are not sure how many clusters you need to take into account (K), hierarchical clustering can prove useful. Also, by taking pairs of features at a time, you can see how the different features affect the cluster structure and therefore get an idea of which ones are more informative.

Validation Metrics for Clustering

Although the objective of clustering is usually to acquire some insight from the available data so as to perform more efficient analytics later on, sometimes we need to evaluate the performance of a clustering system. To do this we can employ a number of metrics, the most popular of which are silhouettes (sometimes referred to as Silhouette Width) and Variation of Information. In this section we’ll examine how the first one works, as it is more useful.

Silhouettes

Silhouettes are popular and robust metrics for assessing how well-defined clusters are, with applications extending to classification as well. A silhouette is basically a ratio of distances that takes values between -1 and 1, with higher values showing that a data point or a cluster is better off being in the cluster it is in. Values below 0 are particularly concerning as they hint toward a complex dataset, or a bad clustering. You can read more about silhouettes in the research article where the concept was first introduced: http://bit.ly/28QQ8aq.

We have implemented this metric in Julia with a function called sil(), which takes the cluster assignments (labels) and the pairwise distance matrix (D) as its inputs. Taking our previous example from the DBSCAN classification we have:

In[25]: sil(labels, D)

Out[25]: (-0.23285061303777332, [-0.159633,-0.207085,-0.17854,     -0.180267,-0.226897,-0.185112,-0.231703,-0.195578,-0.13228,    -0.243666 … -0.250058,-0.264992,-0.185843,-0.23381,-0.22416,   -0.163046,-0.278264,-0.229035,-0.250064,-0.24455])

The outputs of this method are the silhouette of the whole dataset and the silhouettes of the various data points (the latter are stored in a vector). It appears that this particular clustering is not ideal. This may be caused by the cluster imbalance and/or the geometry of the dataset.

Clustering validation metrics tips

Although the aforementioned metrics for evaluating clustering performance have been implemented in the Clustering package, we recommend you avoid using them as they are not yet completely functional. Though well-documented, these metrics may throw errors at random, even with inputs that come straight out of the clustering algorithms of this package.

Overall, clustering performance is more of an academic research topic than something that is used in practice, particularly if you care only about the data exploration application of clustering. However, if you are dealing with a complex dataset and you wish to delve deep into its geometry, or have a data product that depends heavily on clustering, several clustering experiments are in order. To make sure that these experiments are as useful as possible, it would behoove you to carry out proper validation procedures as well. We recommend you implement the Variation of Information as an exercise, once you feel comfortable with Julia programming.

Effective Clustering Tips

Although clustering is as straightforward as it gets in machine learning, it takes a bit of skill to get the most out of it. There are certain things that you need to keep in mind in order to obtain useful insights and avoid getting a diluted signal. Specifically, things like handling high dimensionality and performing normalization can be crucial when it comes to getting your data ready for a clustering algorithm, no matter how effective the algorithm.

Dealing with high dimensionality

High dimensionality can be detrimental to clustering since distance metrics (particularly Euclidean distance and its variants) don’t work on high-dimensional space as expected. All the distances become very large and their range diminishes, making the clusters sparse and practically meaningless. Also, any signal that exists in the dataset gets diluted and is not clearly reflected in the created clusters.

The best way to handle this issue is to remove excessive dimensions (features with weak signals) and/or extract meta-features that capture all the information of the original features of the dataset, as we saw in the previous chapter and in the hierarchical clustering example. Generally, a feature space that allows for large variance in the data points of the dataset also allows for a robust clustering. This often results in feature sets having less than 100 features, though the exact number depends on the nature of the dataset.

Normalization

Keep in mind that data needs to be normalized in the majority of cases when running a clustering algorithm. Unless you are using cosine distance, or some other distance metric that works with all kinds of data, you need to ensure that what you put into the clustering algorithm is of the same scale. Otherwise, you may end up with biased and counter-intuitive outputs.

We recommend you use a normalization method that yields values between 0 and 1. This is particularly useful if you plan to include binary features, which are by their nature in this value spectrum.

Visualization tips

It is important to always create a visualization of the output of a partitional clustering algorithm, even if this means reducing the dimensionality further (e.g. through PCA, as we saw previously). Examining the cluster centers can be useful as well, since these are the “typical” elements of each cluster, even if data points having these exact values may not exist in the dataset. All this is bound to be helpful for interpreting the results and making the most of the whole process.

Summary

  • Unsupervised learning involves data analytics that take place without a target variable.
  • Unsupervised learning yields useful insights about the dataset and helps us understand it better, making supervised learning much more manageable and oftentimes more fruitful.
  • There are various types of unsupervised learning, including clustering, discovery of association rules, and mixture decomposition.
  • Clustering is by far the most popular type of unsupervised learning and has been thoroughly researched.
  • There are many ways to categorize clustering methods, the most common of which is partitional vs. hierarchical, which focuses on the objective of the clustering process. Other categorizations consider the various aspects of the algorithms involved, such as deterministic vs. stochastic.
  • Partitional clustering involves generating a number of groups (partitions) that are mutually exclusive, each containing the most similar data points, while being as dissimilar from the other groups as possible. The number of groups is a parameter of most partitional clustering methods.
  • The vast majority of these methods are stochastic in nature. Partitional clustering is useful not only for data exploration but for creating a target variable for classification. Partitional clustering methods are available in the Clustering package of Julia.
  • K-means is the most popular partitional clustering algorithm, splitting the dataset into K clusters based on the distances among the data points involved. It is extremely fast and works well with relatively simple datasets. You can apply K-means in Julia using the code: kmeans(data, K), where data is the dataset (features are depicted as rows) and K is the number of clusters.
  • DBSCAN is a more powerful partitional clustering algorithm that makes use of the density concept to handle more challenging datasets. It is not as fast as K-means but is more effective overall, and it doesn’t require input of the number of clusters. You can use Julia to run DBSCAN by typing: dbscan(D, d, K), where D is the pairwise distance matrix, d is the minimum density value beyond which a data point is considered dense, and K is the number of nearby data points to be considered in the density calculation.
  • Hierarchical clustering is an alternative clustering methodology that involves the gradual grouping of the data points until a single cluster is reached. This is in the form of a tree structure that shows all possible meaningful groupings on a 2D plane.
  • Hierarchical clustering is useful for exploratory purposes mainly. You can make use of hierarchical clustering in Julia by using the QuickShiftClustering package as follows: quickshift(data), where data is the dataset (features are depicted as rows, like before). You can view the resulting tree structure using the code: quickshiftplot(Z, data, labels), where Z is the output of the quickshift() function, and labels is the assigned labels of the clustering algorithm (output of quickshiftlabels(Z)).
  • All clustering methods make use of some kind of dissimilarity metric such as Euclidean distance. There are several such metrics available in the Distances package of Julia.
  • Validating a clustering system’s output is possible through various metrics, the most popular of which are silhouettes and Variation of Information. Currently only the silhouettes method is fully functional in Julia (see notebook).
  • Silhouette is a distance-based metric that ranges from -1 and 1 and shows how close a data point is to its assigned cluster, compared to other clusters. The higher its value the better. You can make use of it in Julia through the custom function sil(): sil(labels, D), where labels is a vector containing the assigned labels from the clustering model, and D is the pairwise distance matrix.
  • In order to perform clustering more effectively, consider the following:
  • Keep the number of features relatively low (as low as possible without losing a lot of information).
  • Normalize all the featured and meta-features used in the clustering process.
  • Create visuals to better comprehend the various clusters and how they relate to each other.
  • Examine the cluster centers to gain additional insight about the nature of the clusters.

Chapter Challenge

  1. 1. Why are distances important for clustering?
  2. 2. Which clustering method would you use for a very complex dataset?
  3. 3. Why can’t the metrics covered in Chapter 9 be used for evaluating the output of a clustering system?
  4. 4. Can all types of data be clustered? What do you need to keep in mind before doing this?
  5. 5. How is partitional clustering different from t-SNE (Chapter 7)?
  6. 6. Is clustering essential for data science? Why?
  7. 7. How does dimensionality affect clustering performance? What remedies are there for this?
  8. 8. How can you emphasize a particular feature in a normalized dataset so that it plays a more pronounced role in the clustering process?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.136.235