Clustering with Apache Mahout

Clustering is a technique that can be used to divide a dataset into related partitions. In this recipe, we are going to use a specific cluster method called k-means. K-means clustering attempts to divide a dataset into k clusters by minimizing the distance between points located around a central point in a cluster.

In this recipe, we will use the Apache Mahout k-means implementation to cluster the words found in Shakespeare's tragedies.

Getting ready

You will need to download, compile, and install the following:

Extract the contents of shakespeare.zip into a folder named shakespeare_text. The shakespeare.zip archive should contain six works by Shakespeare. Put the shakespeare_text folder and its contents, into HDFS.

$ mkdir shakespeare_text
$ cd shakespeare_text
$ unzip shakespeare.zip
$ cd ..
$ hadoop fs –put shakespeare_text /user/hadoop

How to do it...

Carry out the following steps to perform clustering in Mahout:

  1. Convert the Shakespeare text documents into the Hadoop SequenceFile format:
    mahout seqdirectory --input /user/hadoop/shakespeare_text --output /user/hadoop/shakespeare-seqdir --charset utf-8
  2. Convert the text contents of the SequenceFiles into a vector:
    mahout seq2sparse --input /user/hadoop/shakespeare-seqdir --output /user/hadoop/shakespeare-sparse --namedVector -ml 80 -ng 2 -x 70 -md 1 -s 5 -wt tfidf -a org.apache.lucene.analysis.WhitespaceAnalyzer
  3. Run the k-means clustering algorithm on the document vectors. This command will launch up to ten MapReduce jobs. Also, since we are using k-means clustering, we need to specify the number of clusters we want:
    mahout kmeans --input /user/hadoop/shakespeare-sparse/tfidf-vectors --output /user/hadoop/shakespeare-kmeans/clusters --clusters /user/hadoop/shakespeare-kmeans/initialclusters --maxIter 10 --numClusters 6 --clustering –overwrite
  4. To check the clusters identified by Mahout, use the following command:
    mahout clusterdump --seqFileDir /user/hadoop/shakespeare-kmeans/clusters/clusters-1-final --numWords 5 --dictionary /user/hadoop/shakespeare-sparse/dictionary.file-0 --dictionaryType sequencefile

The results of the clusterdump tool can be overwhelming. Look for the Top Terms: section of the output. For example, following are the top terms for the Romeo and Juliet cluster identified by the k-means algorithm:

r=/romeoandjuliet.txt =]}
        Top Terms:
                ROMEO                        =>   29.15485382080078
                JULIET                       =>   25.78818130493164
                CAPULET                      =>  21.401729583740234
                the                          =>  20.942245483398438
                Nurse                        =>  20.129182815551758

How it works...

The initial steps required us to do some pre-processing on the raw text data prior to running the k-means algorithm with Mahout. The seqdirectory tool, simply converts the contents of a HDFS folder into SequenceFiles. Next, the seq2sparse tool converts the newly created SequenceFiles (which still contain text), into document vectors. The arguments to seq2sparse are described in the following list:

  • --input: A folder in HDFS containing SequenceFiles formatted for Mahout.
  • --output: The output HDFS folder where the document vectors will be stored.
  • --namedVector: A flag to use the named vectors.
  • -ml: A minimum log likelihood threshold. We set this to a high number because we only want to keep the most significant terms.
  • -ng: The n-gram size.
  • -x: A threshold that defines the maximum document frequency a term can appear before it is discarded. In this recipe we chose 70, meaning that any term that appears in greater than 70 percent of the documents will be discarded. Use this setting to discard meaningless words (For example, words such as at, a, and the).
  • -md: The minimum number of documents a term should occur in before it will be considered for processing. In this recipe, we used 1, which means that a term only needs to appear in one document to be processed.
  • -s: The minimum times a term needs to appear in a document before it will be considered for processing.
  • -wt: The weighting algorithm that should be used. Here we chose to use TF-IDF. The other option is TF, which would not help us identify key n-grams.
  • -a: The type of analyzer that should be used. An analyzer is used to transform a text document. The WhitespaceAnalyzer splits a document on whitespace into tokens. The tokens will be kept, combined, or discarded based on the other flags provided to the seq2sparse application.

Finally, we ran the k-means clustering algorithm on the Shakespeare dataset. Mahout will launch a series of MapReduce jobs, which are configurable. The k-means job will complete when either the k-means clusters converge, or the maximum allowed number of MapReduce jobs has been reached. The following are definitions of the parameters we used to configure the k-means Mahout job:

  • --input: The folder in HDFS containing the document vectors.
  • --output: The output folder in HDFS of the k-means job.
  • --maxIter: The maximum number of MapReduce jobs to launch.
  • --numClusters: The number of clusters we want to identify. We chose 6, because there were six Shakespeare documents, and we wanted to identify significant bi-grams around those documents.
  • --clusters: The initial setup cluster points.
  • --clustering: A flag that tells Mahout to iterate over the data before clustering.
  • --overwrite: A flag that tells Mahout to overwrite the output folder.

See also

  • Sentiment classification with Apache Mahout
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.107.193