Sentiment classification with Apache Mahout

Sentiment classification is a classification process that tries to determine a person's propensity to like or dislike certain items. In this recipe, we will use a naive Bayes classifier from Apache Mahout to determine if a set of terms found in a movie review mean the movie had a negative or positive reception.

Getting ready

You will need to download, compile, and install the following:

Extract the movie review dataset review_polarity.tar.gz to the folder you are currently working on. You should see a newly created folder named txt_sentoken. Within that folder there should be two more folders named pos and neg. The pos and neg folders hold text files containing the written reviews of movies. Obviously, the pos folder contains positive movie reviews, and the neg folder contains negative reviews.

How to do it...

  1. Run the reorg_data.py script from the folder you are currently working on to transform the data into training and test sets for the Mahout classifier:
    $ ./reorg_data.py txt_sentoken train test
  2. Prepare the dataset for the Mahout classifier:

    This application will read and write to the local filesystem, and not HDFS.

    $ mahout prepare20newsgroups -p train -o train_formated -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
    $ mahout prepare20newsgroups -p test -o  test_formated -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
  3. Place the train_formated and test_formated folders into HDFS:
    $ hadoop fs –put train_formated /user/hadoop/
    $ hadoop fs –put test_formated /user/hadoop/
  4. Train the naive Bayes classifier using the train_formated dataset:
    $ mahout trainclassifier -i /user/hadoop/train_formated -o /user/hadoop/reviews/naive-bayes-model -type bayes -ng 2 -source hdfs
  5. Test the classifier using the test_formated dataset:
    $ mahout testclassifier -m /user/hadoop/reviews/naive-bayes-model -d prepared-test -type bayes -ng 2 -source hdfs -method sequential
  6. The testclassifiertool should return a similar summary and confusion matrix. The numbers will not be exactly the same as the ones shown in the following:
    Summary
    -------------------------------------------------------
    Correctly Classified Instances       :     285         71.25%
    Incorrectly Classified Instances     :     115         28.75%
    Total Classified Instances           :     400
    
    =======================================================
    Confusion Matrix
    -------------------------------------------------------
    a       b       <--Classified as
    97      103      |  200         a     = pos
    12      188      |  200         b     = neg

How it works...

The first two steps required us to prepare the data for the Mahout naive Bayes classifier. The reorg_data.py script distributed the positive and negative reviews from the txt_sentoken folder into a training and test set. 80 percent of the reviews were placed into the training set, and the remaining 20 percent were used as a test set. Next, we used the prepare20newsgroups tool to format the training and test datasets into a format compatible with the Mahout classifier. The example dataset included in Mahout has a similar format to the data produced by the reorg_data.py script, thus we can use the prepare20newsgroups tool. All that the prepare20newsgroups does is to combine all of the files in the pos and neg folders into a single file based on the dataset class (negative or positive). So, instead of having 1000 positive and negative files, where each file contained a single review, we now have two files named pos.txt and neg.txt, where each contains all of the positive and negative reviews.

Next, we trained a naive Bayes classifier using the n-gram size of 2, specified with the –ng flag, using the train_formated dataset in HDFS. Mahout trains the classifier by launching a series of MapReduce jobs.

Finally, we ran the testclassifier tool to test the classifier we created in step 4, against the test_formated data in HDFS. As we can see from step 6, we correctly classified 71.25 percent of the test data. It is important to note that this statistic does not mean the classifier will be accurate 71.25 percent of the time for every movie review ever. There are a number of ways in which classifiers can be trained and validated. Those techniques go beyond the scope of this book.

There's more...

The testclassifier tool we used in step 6, did not run a MapReduce job. It tested the classifier in local mode. If we wanted to test the classifier using MapReduce, we just need to change the -method parameter to mapreduce.

$ mahout testclassifier -m /user/hadoop/reviews/naive-bayes-model -d prepared-test -type bayes -ng 2 -source hdfs -method mapreduce
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.199.56