Chapter 8. Applying Machine Learning to Sentiment Analysis

In this Internet and social media time and age, people's opinions, reviews, and recommendations have become a valuable resource for political science and businesses. Thanks to modern technologies, we are now able to collect and analyze such data most efficiently. In this chapter, we will delve into a subfield of natural language processing (NLP) called sentiment analysis and learn how to use machine learning algorithms to classify documents based on their polarity: the attitude of the writer. The topics that we will cover in the following sections include:

  • Cleaning and preparing text data
  • Building feature vectors from text documents
  • Training a machine learning model to classify positive and negative movie reviews
  • Working with large text datasets using out-of-core learning

Obtaining the IMDb movie review dataset

Sentiment analysis, sometimes also called opinion mining, is a popular sub-discipline of the broader field of NLP; it analyzes the polarity of documents. A popular task in sentiment analysis is the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic.

In this chapter, we will be working with a large dataset of movie reviews from the Internet Movie Database (IMDb) that has been collected by Maas et al. (A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics). The movie review dataset consists of 50,000 polar movie reviews that are labeled as either positive or negative; here, positive means that a movie was rated with more than six stars on IMDb, and negative means that a movie was rated with fewer than five stars on IMDb. In the following sections, we will learn how to extract meaningful information from a subset of these movie reviews to build a machine learning model that can predict whether a certain reviewer liked or disliked a movie.

A compressed archive of the movie review dataset (84.1 MB) can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ as a gzip-compressed tarball archive:

  • If you are working with Linux or Mac OS X, you can open a new terminal window, use cd to go into the download directory, and execute tar -zxf aclImdb_v1.tar.gz to decompress the dataset
  • If you are working with Windows, you can download a free archiver such as 7-Zip (http://www.7-zip.org) to extract the files from the download archive

Having successfully extracted the dataset, we will now assemble the individual text documents from the decompressed download archive into a single CSV file. In the following code section, we will be reading the movie reviews into a pandas DataFrame object, which can take up to 10 minutes on a standard desktop computer. To visualize the progress and estimated time until completion, we will use the PyPrind (Python Progress Indicator, https://pypi.python.org/pypi/PyPrind/) package that I developed several years ago for such purposes. PyPrind can be installed by executing the command: pip install pyprind.

>>> import pyprind
>>> import pandas as pd
>>> import os
>>> pbar = pyprind.ProgBar(50000)
>>> labels = {'pos':1, 'neg':0}
>>> df = pd.DataFrame()
>>> for s in ('test', 'train'):
...    for l in ('pos', 'neg'):
...        path ='./aclImdb/%s/%s' % (s, l)
...        for file in os.listdir(path):

...            with open(os.path.join(path, file), 'r') as infile:...                txt = infile.read()
...           df = df.append([[txt, labels[l]]], ignore_index=True)
...            pbar.update()
>>> df.columns = ['review', 'sentiment']
0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 725.001 sec

Executing the preceding code, we first initialized a new progress bar object pbar with 50,000 iterations, which is the number of documents we were going to read in. Using the nested for loops, we iterated over the train and test subdirectories in the main aclImdb directory and read the individual text files from the pos and neg subdirectories that we eventually appended to the DataFrame df—together with an integer class label (1 = positive and 0 = negative).

Since the class labels in the assembled dataset are sorted, we will now shuffle DataFrame using the permutation function from the np.random submodule—this will be useful to split the dataset into training and test sets in later sections when we will stream the data from our local drive directly. For our own convenience, we will also store the assembled and shuffled movie review dataset as a CSV file:

>>> import numpy as np
>>> np.random.seed(0)
>>> df = df.reindex(np.random.permutation(df.index))
>>> df.to_csv('./movie_data.csv', index=False)

Since we are going to use this dataset later in this chapter, let us quickly confirm that we successfully saved the data in the right format by reading in the CSV and printing an excerpt of the first three samples:

>>> df = pd.read_csv('./movie_data.csv')
>>> df.head(3)

If you are running the code examples in IPython Notebook, you should now see the first three samples of the dataset, as shown in the following table:

Obtaining the IMDb movie review dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.142.166