The Naïve Bayes classifier

We now have the necessary tools to learn about our first and simplest graphical model, the Naïve Bayes classifier. This is a directed graphical model that contains a single parent node and a series of child nodes representing random variables that are dependent only on this node with no dependencies between them. Here is an example:

The Naïve Bayes classifier

We usually interpret our single parent node as the causal node, so in our particular example, the value of the Sentiment node will influence the value of the sad node, the fun node, and so on. As this is a Bayesian network, the local Markov property can be used to explain the core assumption of the model. Given the Sentiment node, all other nodes are independent of each other.

In practice, we use the Naïve Bayes classifier in a context where we can observe and measure the child nodes and attempt to estimate the parent node as our output. Thus, the child nodes will be the input features of our model, and the parent node will be the output variable. For example, the child nodes may represent various medical symptoms and the parent node might be whether a particular disease is present.

To understand how the model works in practice, we make recourse to Bayes' Theorem, where C is the parent node and Fi are the children or feature nodes:

The Naïve Bayes classifier

We can simplify this using the conditional independence assumptions of the network:

The Naïve Bayes classifier

To make a classifier out of this probability model, our objective is to choose the class Ci that maximizes the posterior probability P(Ci |F1 …Fn ); that is, the posterior probability of that class given the observed features. The denominator is the joint probability of the observed features, which is not influenced by the class that is chosen. Consequently, maximizing the posterior class probability amounts to maximizing the numerator of the previous equation:

The Naïve Bayes classifier

Given some data, we can estimate the probabilities, P(Fi |Cj ), for all the different values of the feature Fi as the relative proportion of the observations of class Cj that have each different value of feature Fi. We can also estimate P(Cj ) as the relative proportion of the observations that are assigned to class Cj. These are the maximum likelihood estimates. In the next section, we will see how the Naïve Bayes classifier works on a real example.

Predicting the sentiment of movie reviews

In a world of online reviews, forums, and social media, a task that has received, and continues to receive, a growing amount of interest is the task of sentiment analysis. Put simply, the task is to analyze a piece of text to determine the sentiment that is being expressed by the author. A typical scenario involves collecting online reviews, blog posts, or tweets and building a model that predicts whether the user is trying to express a positive or a negative feeling. Sometimes, the task can be framed to capture a wider variety of sentiments, such as a neutral sentiment or the degree of sentiment, such as mildly negative versus very negative.

In this section, we will limit ourselves to the simpler task of discerning positive from negative sentiments. We will do this by modeling sentiment, using a similar Bayesian network to the one that we saw in the previous section. The sentiment is our target output variable, which is either positive or negative. Our input features are all binary features that describe whether a particular word is present in a movie review. The key idea here is that users expressing a negative sentiment will tend to choose from a characteristic set of words in their review that is different from the characteristic set that users would pick from when writing a positive review.

By using the Naïve Bayes model, our assumption will be that if we know the sentiment being expressed, the presence of each word in the text is independent from all the other words. Of course, this is a very strict assumption to use and doesn't speak at all to the process of how real text is written. Nonetheless, we will show that even under these strict assumptions, we can build a model that performs reasonably well.

We will use the Large Movie Review Data Set, first presented in the paper titled Learning Word Vectors for Sentiment Analysis, Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, published in The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). The data is hosted at http://ai.stanford.edu/~amaas/data/sentiment/ and is comprised of a training set of 25,000 movie reviews and a test set of another 25,000 movie reviews.

In order to demonstrate how the model works, we would like to keep the training time of our model low. For this reason, we are going to partition the original training set into a new training and test set, but the reader is very strongly encouraged to repeat the exercise with the larger test data set that is part of the original data. When downloaded, the data is organized into a train folder and a test folder. The train folder contains a folder called pos that has 12,500 positive movie reviews, each inside a separate text file, and similarly, a folder called neg with 12,500 negative movie reviews.

Our first task is to load all this information into R and perform some necessary preprocessing. To do this, we are going to install and use the tm package, which is a specialized package for performing text-mining operations. This package is very useful when working with text data and we will use it again in a subsequent chapter.

When working with the tm package, the first task is to organize the various sources of text into a corpus. In linguistics, this commonly refers to a collection of documents. In the tm package, it is just a collection of strings representing individual sources of text, along with some metadata that describes some information about them, such as the names of the files from which they were retrieved.

With the tm package, we build a corpus using the Corpus() function, to which we must provide a source for the various documents we want to import. We could create a vector of strings and pass this as an argument to Corpus() using the VectorSource() function. Instead, as our data source is a series of text files in a directory, we will use the DirSource() function. First, we will create two string variables that will contain the absolute paths to the aforementioned neg and pos folders on our machine (this will depend on where the data set is downloaded).

Then, we can use the Corpus() function twice to create two corpora for positive and negative reviews, which will then be merged into a single corpus.

> path_to_neg_folder <- "~/aclImdb/train/neg"
> path_to_pos_folder <- "~/aclImdb/train/pos"
> library("tm")
> nb_pos <- Corpus(DirSource(path_to_pos_folder), 
                   readerControl = list(language = "en"))
> nb_neg <- Corpus(DirSource(path_to_neg_folder), 
                  readerControl = list(language = "en"))
> nb_all <- c(nb_pos, nb_neg, recursive = T)

The second argument to the Corpus() function, readerControl, is a list of optional parameters. We used this to specify that the language of our text files is English. The recursive parameter in the c() function used to merge the two corpora is necessary to maintain the metadata information stored in the corpus objects.

Note that we can merge the two corpora without actually losing the sentiment label. Each text file representing a movie review is named using the format <counter>_<score>.txt, and this information is stored in the metadata portion of the corpus object created by the Corpus() function. We can see the metadata for the first review in our corpus using the meta() function:

> meta(nb_all[[1]])
Metadata:
  author       : character(0)
  datetimestamp: 2015-04-19 09:17:48
  description  : character(0)
  heading      : character(0)
  id           : 0_9.txt
  language     : en
  origin       : character(0)

The meta() function thus retrieves a metadata object for each entry in our corpus. The ID attribute in this object contains the name of the file. The score part of the name is a number between 0 and 10, where higher numbers denote positive reviews, and low numbers denote negative reviews. In the training data, we only have polar reviews; that is, reviews that are in the ranges 0-4 and 7-10. We can thus use this information to create a vector of document names:

> ids <- sapply( 1 : length(nb_all),
                 function(x) meta(nb_all[[x]], "id"))
> head(ids)
[1] "0_9.txt"     "1_7.txt"     "10_9.txt"    "100_7.txt"
[5] "1000_8.txt"  "10000_8.txt"

From this list of document names, we'll extract the score component using the sub() function with an appropriate regular expression. If the score of a movie review is less than or equal to 5, it is a negative review and if greater, it is a positive review:

> scores <- as.numeric(sapply(ids,
            function(x) sub("[0-9]+_([0-9]+)\.txt", "\1", x)))
> scores <- factor(ifelse(scores >= 5, "positive", "negative"))
> summary(scores)
negative positive
   12500    12500

Tip

The sub() function is just one of R's functions that uses regular expressions. For readers unfamiliar with the concept, a regular expression is essentially a pattern language for describing strings. Online tutorials for regular expressions are easy to find. An excellent resource for learning about regular expressions as well as text processing more generally is Speech and Language Processing Second Edition, Jurafsky and Martin.

The features of our model will be binary features that describe the presence or absence of specific words in the dictionary. Intuitively, we should expect that a movie review containing words such as boring, cliché, and horrible is likely to be a negative review. A movie review with words such as inspiring, enjoyable, moving, and excellent is likely to be a good review.

When working with text data, we almost always need to perform a series of preprocessing steps. For example, we tend to convert all the words to a lowercase format because we don't want to have two separate features for the words Excellent and excellent. We also want to remove anything from our text that will likely be uninformative as features. For this reason, we tend to remove punctuation, numbers, and stop words. Stop words are words like the, and, in, and he, which are very frequently used in the English language and are bound to appear in nearly all of the movie reviews. Finally, because we are removing words from sentences and creating repeated spaces, we will want to remove these in order to assist the process of tokenization (the process of splitting up the text into words).

The tm package has two functions, tm_map() and content_transformer(), which together can be used to apply text transformations to the content of every entry in our corpus:

> nb_all <- tm_map(nb_all, content_transformer(removeNumbers))
> nb_all <- tm_map(nb_all, content_transformer(removePunctuation))
> nb_all <- tm_map(nb_all, content_transformer(tolower))
> nb_all <- tm_map(nb_all, content_transformer(removeWords), 
                           stopwords("english"))
> nb_all <- tm_map(nb_all, content_transformer(stripWhitespace))

Now that we have preprocessed our corpus, we are ready to compute our features. Essentially, what we need is a data structure known as a document term matrix. The rows of the matrix are the documents. The columns of the matrix are the words in our dictionary. Each entry in the matrix is a binary value, with 1 representing the fact that the word represented by the column number was found inside the review represented by the row number. For example, if the first column corresponds to the word action, the fourth row corresponds to the fourth movie review, and the value of the matrix at position (4,1) is 1, this signifies that the fourth movie review contains the word action.

The tm package provides us with the DocumentTermMatrix() function that takes in a corpus object and builds a document term matrix. The particular matrix built has numerical entries that represent the total number of times a particular word is seen inside a particular text, so we will have to convert these into a binary factor afterwards.

> nb_dtm <- DocumentTermMatrix(nb_all)
> dim(nb_dtm)
[1]  25000 117473

Our document term matrix in this case has 117,473 columns, indicating that we have found this number of different words in the corpus. This matrix is very sparse, meaning that most of the entries are 0. This is a very typical scenario when building document term matrices for text documents, especially text documents that are as short as movie reviews. Any particular movie review will only feature a tiny fraction of the words in the vocabulary. Let's examine our matrix to see just how sparse it is:

> nb_dtm
<<DocumentTermMatrix (documents: 25000, terms: 117473)>>
Non-/sparse entries: 2493414/2934331586
Sparsity           : 100%
Maximal term length: 64
Weighting          : term frequency (tf)

From the ratio of non-sparse to sparse entries, we can see that of the 2,936,825,000 entries in the matrix (25000 × 117473), only 2,493,414 are nonzero. At this point, we should reduce the number of columns of this matrix for two reasons. On the one hand, because the words in our vocabulary will become the features in our model, we don't want to build a model that uses 117,473 features. This would take a very long time to train and at the same time is unlikely to provide us with a decent fit using only 25,000 data points.

Another significant reason for us to want to reduce the number of columns is that many words will appear only once or twice in the whole corpus, and will be as uninformative about the user's sentiment as words that occur in nearly all the documents. Given this, we have a natural way to reduce the dimensions of the document term matrix, namely by dropping the columns (that is, removing certain words from the feature set) that are the sparsest. We can remove all columns that have a certain percentage of sparse elements using the removeSparseTerms() function. The first argument that we must provide this with is a document term matrix, and the second is the maximum degree of column sparseness that we will allow. Choosing the degree of sparseness is tricky, because we don't want to throw away too many of the columns that will become our features. We will proceed by running our experiments with 99 percent sparseness, but encourage the reader to repeat with different values to see the effect this has on the number of features and model performance.

We have 25,000 rows in the matrix corresponding to the total number of documents in our corpus. If we allow a maximum of 99 percent sparseness, we are effectively removing words that do not occur in at least 1 percent of those 25,000 documents; that is, in at least 250 documents:

> nb_dtm <- removeSparseTerms(x = nb_dtm, sparse = 0.99)
> dim(nb_dtm)
[1] 25000  1603

We have now significantly reduced the number of columns down to 1,603. This is a substantially more reasonable number of features for us to work with. Next, we convert all entries to binary, using another function of tm, weightBin().

> nb_dtm <- weightBin(nb_dtm)

As the document term matrix is in general a very sparse matrix, R uses a compact data structure to store the information. To peek inside this matrix and examine the first few terms, we will use the inspect() function on a small slice of this matrix:

> inspect(nb_dtm[10:16, 1:6])
<<DocumentTermMatrix (documents: 7, terms: 6)>>
Non-/sparse entries: 2/40
Sparsity           : 95%
Maximal term length: 10
Weighting          : binary (bin)

             Terms
Docs          ability able absolute absolutely absurd academy
  10004_8.txt       0    1        0          0      0       0
  10005_7.txt       0    0        0          0      0       0
  10006_7.txt       0    0        0          0      0       0
  10007_7.txt       0    0        0          0      0       0
  10008_7.txt       0    0        0          0      0       1
  10009_9.txt       0    0        0          0      0       0
  1001_8.txt        0    0        0          0      0       0

It looks like the word ability does not appear in the first six documents and the word able appears in the document 10004_8.txt. We now have both our features and our output vector. The next step is to convert our document term matrix into a data frame. This is needed by the function that will train our Naïve Bayes model. Then, before we train the model, we will split our data into a training set with 80 percent of the documents and a test set with 20 percent of the documents, as follows:

> nb_df <- as.data.frame(as.matrix(nb_dtm))
> library(caret)
> set.seed(443452342)
> nb_sampling_vector <- createDataPartition(scores, p = 0.80, 
                                            list = FALSE)
> nb_df_train <- nb_df[nb_sampling_vector,]
> nb_df_test <- nb_df[-nb_sampling_vector,]
> scores_train = scores[nb_sampling_vector]
> scores_test = scores[-nb_sampling_vector]

To train a Naïve Bayes model, we will use the naiveBayes() function in the e1071 package that we have seen earlier. The first argument we will provide it with is our feature data frame, and the second argument is our vector of output labels:

> library("e1071")
> nb_model <- naiveBayes(nb_dtm_train, scores_train)

We can use the predict() function to obtain predictions on our training data:

> nb_train_predictions <- predict(nb_model, nb_df_train)
> mean(nb_train_predictions == scores_train)
[1] 0.83015
> table(actual = scores_train, predictions = nb_train_predictions)
          predictions
actual     negative positive
  negative     8442     1558
  positive     1839     8161

We have hit over 83 percent training accuracy with our simple Naïve Bayes model, which, admittedly, is not bad for such a simple model with an independence assumption that we know is not realistic for our data. Let's repeat the same on our test data:

> nb_test_predictions <- predict(nb_model, nb_df_test)
> mean(nb_test_predictions == scores_test)
[1] 0.8224
> table(actual = scores_test, predictions = nb_test_predictions)
          predictions
actual     negative positive
  negative     2090      410
  positive      478     2022

The test accuracy of over 82 percent is comparable to what we saw on our training data. There are a number of potential avenues for improvement here. The first involves noticing that words such as movie and movies are treated differently, even though they are the same word but inflected. In linguistics, inflection is the process by which the base form or lemma of a word is modified to agree with another word on attributes such as tense, case, gender, and number. For example, in English, verbs must agree with their subject. The tm package supports stemming, a process of removing the inflected part of a word in order to keep just a stem or root word. This is not always the same as retrieving what is known as the morphological lemma of a word, which is what we look up in a dictionary, but is a rough approximation. The tm package uses the well-known Porter Stemmer.

Note

Martin Porter, the author of the Porter Stemmer, maintains a website at http://tartarus.org/martin/PorterStemmer/, which is a great source of information on his famous algorithm.

To apply stemming to our corpus, we need to add a final transformation to our corpus using tm_map() and then recompute our document term matrix anew as the columns (the word features) are now word stems:

> nb_all <- tm_map(nb_all, stemDocument, language = "english")
> nb_dtm <- DocumentTermMatrix(nb_all) 
> nb_dtm <- removeSparseTerms(x = nb_dtm, sparse = 0.99)
> nb_dtm <- weightBin(nb_dtm)
> nb_df_train <- nb_df[nb_sampling_vector,]
> nb_df_test <- nb_df[-nb_sampling_vector,]
> dim(nb_dtm)
[1] 25000  1553

Note that we have fewer columns that match our criterion of 99 percent maximum sparsity. We can use this new document term matrix to train another Naïve Bayes classifier and then measure the accuracy on our test set:

> nb_model_stem <- naiveBayes(nb_df_train, scores_train)
> nb_test_predictions_stem <- predict(nb_model_stem, nb_df_test)
> mean(nb_test_predictions_stem == scores_test)
[1] 0.8
> table(actual = scores_test, predictions = 
                              nb_test_predictions_stem)
          predictions
actual     negative positive
  negative     2067      433
  positive      567     1933

The result, 80 percent, is slightly lower than what we observed without stemming, although we are using slightly fewer features than before. Stemming is not always guaranteed to be a good idea, as in some problems it may improve performance whereas in others it will make no difference or even make things worse. It is, however, a common transformation that is worth trying when working with text data.

A second possible improvement is to use additive smoothing (also known as laplacian smoothing) during the training of our Naïve Bayes model. This is actually a form of regularization and it works by adding a fixed number to all the counts of feature and class combinations during training. Using our original document term matrix, we can compute a Naïve Bayes model with additive smoothing by specifying the laplace parameter. For our particular data set, however, we did not witness any improvements by doing this.

There are a few more avenues of approach that we might try with a Naïve Bayes model, and we will propose them here for the reader to experiment with. The first of these is that it is often worth manually curating the list of words used as features for the model. When we study the terms selected by our document term matrix, we may find that some words are frequent in our training data but we do not expect them to be frequent in general, or representative of the overall population. Furthermore, we may only want to experiment with words that we know are suggestive of emotion and sentiment. This can be done by specifying a specific dictionary of terms to use when constructing our document term matrix. Here is an example:

> emotion_words <- c("good", "bad", "enjoyed", "hated", "like")
> nb_dtm <- DocumentTermMatrix(nb_all, list(dictionary = 
                                            emotion_words))

It is relatively straightforward to find examples of such lists on the Internet. Another common preprocessing step that is used with a Naïve Bayes model is to remove correlations between features. One way of doing this is to perform PCA, as we saw in Chapter 1, Gearing Up for Predictive Modeling. Furthermore, this method also allows us to begin with a slightly more sparse document term matrix with a larger number of terms, as we know we will be reducing the overall number of features with PCA.

Potential model improvements notwithstanding, it is important to be aware of the limitations that the Naïve Bayes model imposes that impede our ability to train a highly accurate sentiment analyzer. Assuming that all the words in a movie review are independent of each other, once we know the sentiment involved, is quite an unrealistic assumption. Our model completely disregards sentence structure and word order. For example, the phrase not bad in a review might indicate a positive sentiment, but because we look at words in isolation, we will tend to correlate the word bad with a negative sentiment. Negation in general is one of the hardest problems to handle in text processing. Our model also cannot handle common patterns of language, such as sarcasm, irony, quoted passages that include other people's thoughts, and other such linguistic devices.

The next section will introduce a more powerful graphical model.

Note

A good reference to study for the Naïve Bayes classifier is An empirical study of the Naïve Bayes classifier, I. Rish, presented in the 2001 IJCAI workshop on Empirical Methods in AI. For sentiment analysis, we recommend the slides from Bing Liu's AAAI 2011 tutorial at http://www.cs.uic.edu/~liub/FBS/Sentiment-Analysis- tutorial-AAAI-2011.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.34.226