One of the fundamentals of data science is finding patterns in the data at hand. As you may remember from Chapter 5, this is accomplished mainly during the data exploration part of the pipeline. Although the various statistical methods and plots yield much useful insight, sometimes we must employ a more sophisticated approach. This is particularly useful if we have certain expectations about the data (such as different groups or patterns that should exist) and wish to explore them further.
This lends itself to what is known as unsupervised learning: the kind of machine learning where labels are unknown or nonexistent altogether. The aim of this strategy is to find these labels in a way that makes sense for the data at hand, while also uncovering the structure of the dataset. In this chapter we’ll explore this topic in some depth to examine how unsupervised learning can be done in Julia effectively.
In this chapter we will examine the following topics:
Before we continue, make sure that the Clustering package is installed on your system.
Unsupervised Learning Basics
Unsupervised learning is extremely important for a variety of reasons. First of all, it provides insight about patterns in the data that you would not be able to pinpoint with other methods. This is particularly useful during the data exploration stage where you are uncertain what signals the data contains. You could find these signals with other means, but it would take a significant amount of additional effort.
This brings us to the next point, which is the amount of time unsupervised learning saves. This extends beyond data exploration to something that’s even more important when it comes to productionalizing your work. A classic example is figuring out the labels of a dataset, which is possible with one of the methods of unsupervised learning. There are numerous ways of finding labels (the most obvious of which is hiring some people to do it manually), but these methods don’t scale well. Unsupervised learning is both fast and scalable, while it is cheap comparatively.
Finally, this machine learning approach allows us to organize the data at hand and understand it better. At the same time we can make more meaningful plots that can shed additional light on the problem we are trying to solve. And unlike conventional plots, unsupervised learning techniques provide you with a variety of metrics. There are several unsupervised learning techniques, all of which have their niche. The most important of these are the following:
There are several other methods that qualify as unsupervised learning, though they are seldom encountered in data science projects. In this chapter we will focus on the first type of unsupervised learning, as it is the most commonly used and is crucial in providing actionable information about the data at hand.
Clustering types
There are two main types of clustering, depending on the inner workings of the clustering method: partitional and hierarchical. The first type splits the data into a number of mutually exclusive groups (usually denoted as K), while the second one creates a tree where the most similar elements are grouped together gradually. Partitional clustering is particularly useful for creating a labels vector that could be used later on for classification. Both partitional and hierarchical clustering are excellent tools for data exploration.
Apart from this categorization, there are other ways to differentiate the clustering methods. One way to categorize clustering methods is as hard or as fuzzy, depending on whether a data point belongs to a cluster exclusively or to some degree. Another descriptor is deterministic or stochastic, depending on whether the algorithm used employs randomness or not.
In all the clustering methods, the points compared are evaluated as to how similar (or dissimilar) they are to each other, in order to determine whether they belong to the same group or not. For this we usually use a distance metric, such as Euclidean distance. However, there are several other alternatives, as we’ll see in the following section.
Distance metrics
There are several metrics that can be used to express the distance between two data points, the most important of which are listed below. For all the examples we make use of two float vectors x and y of equal dimensionality. In order to utilize these metrics in Julia, you’ll need to add and load the Distances package (check http://bit.ly/29vRs0A for more information).
So, if we have a couple of vectors x = [1.0, 2.0, 3.0] and y = [-1.0, 2.5, 0.0], we could calculate their distance in Julia using some of the above metrics:
In[1]: d = evaluate(Euclidean(), x, y)
Out[1]: 3.640054944640259
In[2]: d = evaluate(CosineDist(), x, y)
Out[2]: 0.6029666664116279
In[3]: d = evaluate(Cityblock(), x, y)
Out[3]: 5.5
In[4]: d = evaluate(Jaccard(), x, y)
Out[4]: 0.8461538461538461
Grouping Data with K-means
K-means is probably the most commonly used clustering technique, due to its simplicity and speed. Using a single parameter, K, it splits the data into K different groups (referred to as clusters), based on how close each data point is to the cluster center. The cluster centers change in every iteration of the algorithm, based on the data points that belong to these clusters; they are usually calculated as the means of the values of the data points in them (hence the name of the algorithm). The original clusters are defined randomly, making every run of the K-means algorithm yield slightly different results. The algorithm stops once the cluster centers don’t change significantly or until a maximum number of iterations has been reached.
Although in the original version of K-means the cluster centers are calculated based on the mean of the data points that belong to them, there are a few variants of the method where different averaging metrics (such as the median) are used. Also, there is a powerful variant of K-means that uses fuzzy logic, so each data point belongs to all possible clusters with some membership value, known as C-means. Although these variations are tempting, we’ll limit ourselves to the most basic method of K-means in this section, as this is the one most often used in practice. In Figure 10.1 you can see a typical output of the K-means algorithm (which is representative of other partitional clustering algorithms, too).
Figure 10.1 An example of a typical K-means clustering output.
K-means using Julia
For K-means in Julia we’ll be using the Clustering package, which also includes some other clustering techniques (see http://bit.ly/29vRs0A for details on the package’s contents). Once you add it and load it into memory you can start using it on your data.
Let’s look at how we can apply it to the OnlineNewsPopularity dataset. Here we will work under the hypothesis that there are two distinct categories of news articles (the popular ones and the not-so-popular ones), so we’ll use K = 2:
In[5]: using Clustering
In[6]: data = readdlm(“d://data//OnlineNewsPopularity//OnlineNewsPopularity.csv”, ‘,’);
In[7]: F = map(float, data[2:end,2:60]) #make all features floats
In[8]: F = normalize(F, “linear”)
In[9]: Z = kmeans(F’, 2)
We need to transpose the inputs (features) since the kmeans() function takes its data in a different form (each row being a different feature and each column a data point). Also, since there are several outputs, it is best to put them all in a single data structure (variable Z) so that later we can retrieve them individually through that variable. Specifically, if we were interested in the cluster labels, we would type:
In[10]: labels = assignments(Z); show(labels[1:10])
[1,1,1,1,1,1,1,1,1,2]
Of course, there are several more values in the labels vector, but here we just view the first ten to get an idea of what the labels vector looks like. The numbers that correspond to the two clusters are arbitrarily chosen by the algorithm and may not be the same in the next run. It is often the case that even if the clusters are more or less identical, the same data points that are shown here as 1s may be later labeled as 2s.
We could manually count all these labels to get an idea of how large each cluster is. Thankfully, the package has a specialized function, counts(), that you can apply to accomplish this task easily:
In[11]: c = counts(Z); show(c)
[31208,8436]
What about the centers of the clusters? We can obtain them from the outputs data structure with the .centers identifier:
In[12]: C = Z.centers; show(C[1:5,:])
[0.505558031543492 0.3821391822047006
0.39736453412433603 0.40946961999592846
0.06286506180190116 0.07051594966020513
0.0007965795130288157 0.00072829082065273
0.0009634138252223355 0.0009300011157573742]
Naturally, there are several more rows in the centers variable (C): one for each feature of the dataset. However, due to space limitations, we show here just the first five rows, for the two cluster centers. By examining the cluster centers more closely we can see that certain variables (e.g. the first one, timedelta) play a more important role in keeping the clusters distant, as the difference in the centers’ coordinates in the corresponding dimensions is larger. This may provide additional insight, as those variables seem to divulge more useful information related to the dataset’s structure. We would expect, then, that these variables will work better as predictors later on.
K-means tips
Although the K-means method works well in the majority of situations, it fails to produce meaningful results in some cases. Specifically, when the number of clusters (K) is off, the resulting labels may be nonsensical. So, before you run this algorithm (or any of its variants), be sure to have a good idea of what you are looking for.
In a sentiment analysis problem, for example, a good value for K would be 2 (positive and negative sentiment), or 3 (positive, negative, and neutral sentiment). Asking K-means to provide clusters for K = 10 for this problem may not make much sense, which would be reflected in the very small differences in the coordinates of the cluster centers. It is a good idea to carefully consider how you will interpret the results, since Julia can only help you so much with your partitional clustering endeavors.
Density and the DBSCAN Approach
Although it is one of the most fundamental concepts in physics, density hasn’t been properly implemented in the world of data science, with many people still confusing it with probability density. In essence, density kernels are heuristics that approximate the true density of a particular data point. Unless you already know a lot about the dataset, finding the optimum parameters of such a kernel is a challenge.
With that being said, even the crude implementations of this concept manage to work decently and make a new generation of clustering algorithms possible (although density has the potential to enhance the data analytics field).
The aim of density is to provide a reliable measure of how crowded a particular part of the feature-space is. Naturally, the more data points there are in a particular area, the higher the density and the lower the entropy of the dataset, as it is closely linked to order. Density is basically the most direct way of pinpointing a signal in the data, and it is based on tangible measurements.
In clustering, density is used in algorithms like DBSCAN, which we will cover in more depth in the next section. As density is closely linked to distances, you need to have a distance matrix calculation function handy, like the one we provide in the corresponding notebook of this chapter.
DBSCAN algorithm
DBSCAN is probably the most well-known alternative to K-means. It stands for Density Based Spatial Clustering of Applications with Noise, and is a very robust partitional clustering method. The fundamental premise behind this algorithm is that we expect every cluster to be defined by a probability distribution that is lighter near the edges and darker in the center.
This translates into a higher probability of picking a point near the center of a cluster than near a cluster’s boundaries. This simple idea has powerful implications, as it can filter out the noisy elements of the dataset and construct the clusters based on the most concrete (i.e. dense) ones.
DBSCAN takes as its input the distance matrix of the dataset’s points, along with two parameters, d and K. The first parameter corresponds to how far around a data point we want to look for the density estimation. (There is an objective way to pick this parameter for all density estimations, but as there is no scientific reference for it, we’ll leave it out.) Naturally, d is a float.
The second parameter represents the density threshold that distinguishes dense from sparse data points (this too can be objectively defined, but there is not scientific canon for it either). Specifically, it refers to the number of data points in the area examined (i.e. K is an integer). Both d and K relate to the data at hand; there are no default values recommended by the creators of the algorithm.
Unlike K-means, DBSCAN does not require as a parameter the number of clusters, as it finds the optimum partitioning based on the density values of the various data points.
Applying DBSCAN in Julia
Let us now see how we can apply DBSCAN in Julia (using the Clustering package as before). Due to potential memory limitations, we are going to work with just the first 10,000 data points of the dataset. First, let’s calculate the distance matrix of these points:
In[13]: D = dmatrix(F[1:10000,:]);
Now, let’s find their average distance:
In[14]: mean(D)
Out[14]: 4.874257821045708
It would be a good idea to use a lower d parameter (e.g. 3.0). As for the K parameter, a value around the total number of dimensions should work well (e.g. 60). So, we can apply DBSCAN as follows:
In[15]: Z = dbscan(D, 3.0, 60)
Let’s see what the labels look like by using the same functions as in the K-means example:
In[16]: labels = assignments(Z); show(labels[1:10])
[1,1,1,1,1,1,1,1,1,1]
It looks like the first cluster is dominant. Let’s see what the cluster sizes are like:
In[17]: c = counts(Z); show(c)
[8699,1149]
So, for this two-cluster model, the results are similar to the results of the K-means method. Should we change the parameters, though, we may discover a different cluster setup.
Hierarchical Clustering
Hierarchical clustering doesn’t assume a specific number of clusters, like most algorithms of partitioning clustering. A valuable data exploration tool, it examines how the data points relate to each other and constructs a tree showing how these relationships apply to larger and larger groups, starting from a single data point. This yields a graph with the various points of the dataset connected in pairs, with the most similar ones linked together. You can see a typical hierarchical clustering output in Figure 10.2.
Figure 10.2 A typical hierarchical clustering tree output.
Although the output of a hierarchical clustering algorithm is not all that useful in finding the labels of the data at hand, it can be useful in understanding the data. It can potentially indicate the optimum number of clusters (i.e. pinpointing the optimal K parameter in K-means). Due to the limitations of the plotting package used on the back end (PyPlot), visuals of the cluster hierarchy are visible only for two dimensional data. Let’s see how we can apply this clustering technique in Julia using the corresponding package.
Applying hierarchical clustering in Julia
To perform hierarchical clustering we’ll need to make use of the QuickShiftClustering package. First let’s load this package, as well as the PyPlot package, which is used for plotting the hierarchical tree:
In[18]: using QuickShiftClustering
using PyPlot
Since we’ll need to do some dimensionality reduction in order to view a meaningful plot based on the clustering, we’ll also make use of the MultivariateStats package:
In[19]: using MultivariateStats
Now we can create the tree for the hierarchical clustering. To avoid spending too much time waiting for the results (particularly in the final stage of the process), we’ll limit the data used to a sample of 300 data points and 2 features. First, we’ll get the first two meta-features using a PCA model:
In[20]: M = fit(PCA, F’; maxoutdim = 2)
Next, we’ll create a transformed sample of the dataset, having two dimensions (the above features of the PCA model) and 300 data points:
In[21]: G = transform(M, F[1:300,:]’)’
Afterward, we can run it through the quickshift() function to get a hierarchical clustering model:
In[22]: Z = quickshift(G’)
We can get the labels of the clustering using the quickshiftlabels() function:
In[23]: labels = quickshiftlabels(Z);
The output of this function is not particularly meaningful, since it is used mainly in creating the plot that follows. To create this plot, we’ll make use of the quickshiftplot() function:
In[24]: quickshiftplot(Z, G’, labels)
You can see the various links of the data points and how they all merge together to form larger and larger groups of data, with ever-bigger links forming, until everything in the dataset is connected (Figure 10.3). It becomes clear that the groupings we saw in the previous clustering approaches are meaningful after all, since there are two distinct clusters of data points (possibly corresponding to the normal and viral news, respectively).
Figure 10.3 An output plot for the QuickShiftClustering clustering model.
A word of caution: although the Clustering package also includes methods to handle hierarchical clustering (hclust() and cutree() functions), they are still works-in-progress; at the time of this writing they don’t perform as expected. We recommend you avoid using both until they have become fully functional and have some documentation on the package web page.
When to use hierarchical clustering
Although not as popular as K-means or DBSCAN, this is a particularly useful clustering strategy for a few use cases where partitional methods cannot deliver. Namely, it is ideal for data exploration and for parameter searching for other clustering algorithms. For example, if you are not sure how many clusters you need to take into account (K), hierarchical clustering can prove useful. Also, by taking pairs of features at a time, you can see how the different features affect the cluster structure and therefore get an idea of which ones are more informative.
Validation Metrics for Clustering
Although the objective of clustering is usually to acquire some insight from the available data so as to perform more efficient analytics later on, sometimes we need to evaluate the performance of a clustering system. To do this we can employ a number of metrics, the most popular of which are silhouettes (sometimes referred to as Silhouette Width) and Variation of Information. In this section we’ll examine how the first one works, as it is more useful.
Silhouettes
Silhouettes are popular and robust metrics for assessing how well-defined clusters are, with applications extending to classification as well. A silhouette is basically a ratio of distances that takes values between -1 and 1, with higher values showing that a data point or a cluster is better off being in the cluster it is in. Values below 0 are particularly concerning as they hint toward a complex dataset, or a bad clustering. You can read more about silhouettes in the research article where the concept was first introduced: http://bit.ly/28QQ8aq.
We have implemented this metric in Julia with a function called sil(), which takes the cluster assignments (labels) and the pairwise distance matrix (D) as its inputs. Taking our previous example from the DBSCAN classification we have:
In[25]: sil(labels, D)
Out[25]: (-0.23285061303777332, [-0.159633,-0.207085,-0.17854, -0.180267,-0.226897,-0.185112,-0.231703,-0.195578,-0.13228, -0.243666 … -0.250058,-0.264992,-0.185843,-0.23381,-0.22416, -0.163046,-0.278264,-0.229035,-0.250064,-0.24455])
The outputs of this method are the silhouette of the whole dataset and the silhouettes of the various data points (the latter are stored in a vector). It appears that this particular clustering is not ideal. This may be caused by the cluster imbalance and/or the geometry of the dataset.
Clustering validation metrics tips
Although the aforementioned metrics for evaluating clustering performance have been implemented in the Clustering package, we recommend you avoid using them as they are not yet completely functional. Though well-documented, these metrics may throw errors at random, even with inputs that come straight out of the clustering algorithms of this package.
Overall, clustering performance is more of an academic research topic than something that is used in practice, particularly if you care only about the data exploration application of clustering. However, if you are dealing with a complex dataset and you wish to delve deep into its geometry, or have a data product that depends heavily on clustering, several clustering experiments are in order. To make sure that these experiments are as useful as possible, it would behoove you to carry out proper validation procedures as well. We recommend you implement the Variation of Information as an exercise, once you feel comfortable with Julia programming.
Effective Clustering Tips
Although clustering is as straightforward as it gets in machine learning, it takes a bit of skill to get the most out of it. There are certain things that you need to keep in mind in order to obtain useful insights and avoid getting a diluted signal. Specifically, things like handling high dimensionality and performing normalization can be crucial when it comes to getting your data ready for a clustering algorithm, no matter how effective the algorithm.
Dealing with high dimensionality
High dimensionality can be detrimental to clustering since distance metrics (particularly Euclidean distance and its variants) don’t work on high-dimensional space as expected. All the distances become very large and their range diminishes, making the clusters sparse and practically meaningless. Also, any signal that exists in the dataset gets diluted and is not clearly reflected in the created clusters.
The best way to handle this issue is to remove excessive dimensions (features with weak signals) and/or extract meta-features that capture all the information of the original features of the dataset, as we saw in the previous chapter and in the hierarchical clustering example. Generally, a feature space that allows for large variance in the data points of the dataset also allows for a robust clustering. This often results in feature sets having less than 100 features, though the exact number depends on the nature of the dataset.
Normalization
Keep in mind that data needs to be normalized in the majority of cases when running a clustering algorithm. Unless you are using cosine distance, or some other distance metric that works with all kinds of data, you need to ensure that what you put into the clustering algorithm is of the same scale. Otherwise, you may end up with biased and counter-intuitive outputs.
We recommend you use a normalization method that yields values between 0 and 1. This is particularly useful if you plan to include binary features, which are by their nature in this value spectrum.
Visualization tips
It is important to always create a visualization of the output of a partitional clustering algorithm, even if this means reducing the dimensionality further (e.g. through PCA, as we saw previously). Examining the cluster centers can be useful as well, since these are the “typical” elements of each cluster, even if data points having these exact values may not exist in the dataset. All this is bound to be helpful for interpreting the results and making the most of the whole process.
Summary
Chapter Challenge
18.119.136.235