Clustering is a form of unsupervised learning where the task of the learning algorithm is to find some structure in the given dataset. In particular, a notion of similarity or distance among different instances of a dataset is used to learn such clusters. Spark provides K-Means, expectation-maximization (EM), power iteration clustering (PIC), Latent Dirichlet Allocation (LDA), and streaming K-Means.
K-Means is one of the most popular clustering algorithms in which we pre-determine the parameter K—the number of clusters. Or in a more formal way we can define K-Means as a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K) which are represented by their centroids. In K-Means, the centroid of a cluster is a prototype that best distinguishes the whole cluster. Since K-Means partitions the points based on the distance between them, we must have a data representation in which it is easy to calculate the distance between the two instances; there must be a distance metric. Apache Spark uses Lloyd's K-Means algorithm with Euclidean distance metric, however, we cannot specify any other distance function at the moment. We must also note that K-Means always produces clusters in a way that an instance will be part of only one of the K clusters; clusters are distinct.
The Gaussian Mixture Model (GMM) allows us to model the clustering in a way that we can produce fuzzy clusters. In fuzzy clusters an instance of a data point can be part of multiple clusters. Apache Spark provides the expectation-maximization (EM) algorithm for this fuzzy clustering, with a fixed K number of clusters. However, note that the EM algorithm does a maximum-likelihood estimation (MLE) of model parameters so as to maximize the likelihood of a cluster membership given a data point.
In contrast with other clustering algorithms, power iteration clustering (PIC) works on graphs where for the graph vertices; we have to provide a similarity score between vertices (or the edge weights). The higher the edge weight, the stronger the similarity; edge weight must be non-negative and an edge must not repeat. This algorithm takes a parameter K for the number of clusters to generate.
Latent Dirichlet Allocation (LDA) is a topic modeling algorithm for text documents. This algorithm also takes a parameter K for the number of topics to generate. This algorithm is suitable for clustering text documents like news articles, research papers, books, and so on.
All the previous clustering algorithms we saw earlier, take a whole set of data points and then generate clusters. However, for the cases where we want our clusters to be updated as soon as the new data arrives, dynamically, we can use streaming K-Means algorithm. It is essentially the same K-Means algorithm, with two additions: it performs clustering on mini-batches and also has an update rule to update cluster centers with newly learned results.
We will take one example of clustering, and for this we will see the BBC news dataset at http://mlg.ucd.ie/datasets/bbc.html:
Since this is a text clustering problem, we first need to extract features from text data. We can do this by performing tokenizing and then some cleaning on the terms. This can also include advanced NLP techniques (NER, stemming, acronyms, and so on), but we will use simple rules and regular expressions to perform tokenization using the following code:
After we have mapped our text documents into tokens, we need to calculate TFs. For this we will use the HashingTF
class provided by Apache Spark. HashingTF
expects a RDD of Seq [String]
, which we can obtain from the tokenized documents. Its code is as follows:
Finally, we run the LDA algorithm on the term-frequency vectors and obtain a topic-term distribution. In the following code, we only write to 10 terms IDs with their scores:
Here is the output with the top five terms with their score calculated by the LDA algorithm. These scores are drawn from the term distribution for each topic:
$ sbt "run-main chapter04.TextLDA" ... OUTPUT SKIPPED ... Topic 0: (3365,3913.65720852964) (3543,4326.574428888365) (3707,4535.491948332051) (96727,4778.215907939703) (114801,11964.705438792551) Topic 1: (3365,2396.035833693271) (96727,3298.3282136942335) (3543,3527.026301333385) (3707,4862.134970235111) (114801,8261.274505002908) Topic 2: (96727,2850.4724248980783) (3543,2952.916293771327) (3707,3792.9462669647196) (3365,3998.609192912204) (114801,9039.540687387185) Topic 3: (96727,4008.45003946005) (3365,4435.302735745526) (3543,5767.906047678759) (3707,6886.8060016365625) (114801,14050.285613710528) Topic 4: (3365,2877.395029119364) (3543,3382.5769283281697) (96727,3627.5334140079353) (3707,4913.620812831554) (114801,9254.193755106819)
18.226.177.86