Clustering

Clustering is a form of unsupervised learning where the task of the learning algorithm is to find some structure in the given dataset. In particular, a notion of similarity or distance among different instances of a dataset is used to learn such clusters. Spark provides K-Means, expectation-maximization (EM), power iteration clustering (PIC), Latent Dirichlet Allocation (LDA), and streaming K-Means.

K-Means

K-Means is one of the most popular clustering algorithms in which we pre-determine the parameter K—the number of clusters. Or in a more formal way we can define K-Means as a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K) which are represented by their centroids. In K-Means, the centroid of a cluster is a prototype that best distinguishes the whole cluster. Since K-Means partitions the points based on the distance between them, we must have a data representation in which it is easy to calculate the distance between the two instances; there must be a distance metric. Apache Spark uses Lloyd's K-Means algorithm with Euclidean distance metric, however, we cannot specify any other distance function at the moment. We must also note that K-Means always produces clusters in a way that an instance will be part of only one of the K clusters; clusters are distinct.

Expectation-maximization

The Gaussian Mixture Model (GMM) allows us to model the clustering in a way that we can produce fuzzy clusters. In fuzzy clusters an instance of a data point can be part of multiple clusters. Apache Spark provides the expectation-maximization (EM) algorithm for this fuzzy clustering, with a fixed K number of clusters. However, note that the EM algorithm does a maximum-likelihood estimation (MLE) of model parameters so as to maximize the likelihood of a cluster membership given a data point.

Power iteration clustering

In contrast with other clustering algorithms, power iteration clustering (PIC) works on graphs where for the graph vertices; we have to provide a similarity score between vertices (or the edge weights). The higher the edge weight, the stronger the similarity; edge weight must be non-negative and an edge must not repeat. This algorithm takes a parameter K for the number of clusters to generate.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a topic modeling algorithm for text documents. This algorithm also takes a parameter K for the number of topics to generate. This algorithm is suitable for clustering text documents like news articles, research papers, books, and so on.

LDA example

All the previous clustering algorithms we saw earlier, take a whole set of data points and then generate clusters. However, for the cases where we want our clusters to be updated as soon as the new data arrives, dynamically, we can use streaming K-Means algorithm. It is essentially the same K-Means algorithm, with two additions: it performs clustering on mini-batches and also has an update rule to update cluster centers with newly learned results.

We will take one example of clustering, and for this we will see the BBC news dataset at http://mlg.ucd.ie/datasets/bbc.html:

  • Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005
  • There are five class labels (business, entertainment, politics, sport, and tech)

Since this is a text clustering problem, we first need to extract features from text data. We can do this by performing tokenizing and then some cleaning on the terms. This can also include advanced NLP techniques (NER, stemming, acronyms, and so on), but we will use simple rules and regular expressions to perform tokenization using the following code:

LDA example

After we have mapped our text documents into tokens, we need to calculate TFs. For this we will use the HashingTF class provided by Apache Spark. HashingTF expects a RDD of Seq [String], which we can obtain from the tokenized documents. Its code is as follows:

LDA example

Finally, we run the LDA algorithm on the term-frequency vectors and obtain a topic-term distribution. In the following code, we only write to 10 terms IDs with their scores:

LDA example

Here is the output with the top five terms with their score calculated by the LDA algorithm. These scores are drawn from the term distribution for each topic:

$ sbt "run-main chapter04.TextLDA"
... OUTPUT SKIPPED ...
Topic 0: (3365,3913.65720852964) (3543,4326.574428888365) (3707,4535.491948332051) (96727,4778.215907939703) (114801,11964.705438792551)
Topic 1: (3365,2396.035833693271) (96727,3298.3282136942335) (3543,3527.026301333385) (3707,4862.134970235111) (114801,8261.274505002908)
Topic 2: (96727,2850.4724248980783) (3543,2952.916293771327) (3707,3792.9462669647196) (3365,3998.609192912204) (114801,9039.540687387185)
Topic 3: (96727,4008.45003946005) (3365,4435.302735745526) (3543,5767.906047678759) (3707,6886.8060016365625) (114801,14050.285613710528)
Topic 4: (3365,2877.395029119364) (3543,3382.5769283281697) (96727,3627.5334140079353) (3707,4913.620812831554) (114801,9254.193755106819)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.177.86