Case study 2 – Naive Bayes classifier

In the previous chapter, we described how Naive Bayes is a type of classifier, that is, a statistical model designed to estimate group membership of observations. If we have a sufficient amount of training data, we can use it to train or learn a statistical model that we can subsequently use to estimate the sentiment of other, unlabeled observations. The key assumption underlying this technique is that at least some words are used with different frequencies by those with positive and negative sentiments towards a particular target. This section walks through the implementation of Naive Bayes for sentiment classification.

For demonstrative purposes, we have scraped about 4,000 tweets using the methods set out in Chapter 3, Mining Twitter with R. About half include the hashtag #prolife, and the other half include the hashtag #prochoice. It seems likely that these tweets satisfy the earlier assumption; that is, tweets using each of these opposing hashtags likely use different words and phrases to a different extent. The following code sets up a data frame of the tweets and appends the variable "hash" (as in hashtag) to the data frame, where 1 denotes the use of #prochoice and 0 denotes the use of #prolife. As this is a supervised method, we use the hashtags as labels for the supervised learning process. The results of the model will allow us to categorize unlabeled tweets in future.

Suppose that all the tweets are in a list called abortion_tweets, as would be output by the searchTwitter function introduced in Chapter 3, Mining Twitter with R. Also, suppose that we've created a vector named hash, which is a set of 1's repeating for the number of prochoice tweets concatenated with a vector of zeroes as long as the number of prolife tweets, as follows:

# generate a data frame from the list of tweets
> require(twitteR)
> twtsdf<- twListToDF(abortion_tweets)

> twtsdf$hash<- hash

# Drop unneeded variables from the data frame
> keeps <- c("text", "id", "retweetCount", "isRetweet", "screenName", "hash")
> twtsdf<- twtsdf[,keeps]

The following loop generates a list of the tweets, where each tweet is broken into a vector of separate words rather than a sentence. It also compiles a list of the usernames, called names. Lastly, we separate and keep the vector of hashtags and name it outcome as shown in the following code snippet:

> list.vector.words<- list()
> allwords<- NULL

> names<- NULL
  for (i in 1:dim(twtsdf)[1]){
    each.vector<- strsplit(twtsdf$text[i], split="")
    names<- c(names, twtsdf$screenName[i])
    allwords<- c(allwords, each.vector)
    list.vector.words[[i]] <- each.vector
}

> outcome<- twtsdf$hash

Next, we use the tm package to create a corpus like the one shown in Chapter 3, Mining Twitter with R. Also, we'll remove the hashtags, as we are going to treat that information as labels for the supervised learning process as follows:

>require(tm) 
# make a corpus
> dat.tm <- Corpus(VectorSource(list.vector.words)) 
# convert all words to lowercase
> dat.tm <- tm_map(dat.tm, tolower) 
# remove punctuation
> dat.tm <- tm_map(dat.tm, removePunctuation) 
# remove the hashtags
> dat.tm <- tm_map(dat.tm, removeWords, words=c("prochoice")) 
> dat.tm <- tm_map(dat.tm, removeWords, words=c("prolife")) 
# remove extra white space
> dat.tm <- tm_map(dat.tm, stripWhitespace) 
# stem all words
> dat.tm <- tm_map(dat.tm, stemDocument) 

Next, we create a document-term matrix, with one twist. Instead of using single words, we use bigrams (ordered two-word pairs). Thus, instead of breaking the sentence "See spot run" into three single words, we make two bigrams, "see spot" and "spot run." This helps us capture more nuance than single words, which helps deal with extreme sentiments and negation as follows:

# create a bigram tokenizer using the RWeka package 
> require(RWeka)
> BigramTokenizer<- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
# create the document-term matrix
> datmat<- DocumentTermMatrix(dat.tm, control = list(tokenize = BigramTokenizer))
> dat<- as.matrix(datmat)    
# Add user names as rownames to matrix
> rownames(dat) <- names

Only bigrams that are used by a sufficient number of authors are useful. Thus, we need a bit of code to remove bigrams that are uncommon. The following lines create a vector of the column sums (that is, the number of times each bigram is used) and then creates a table of those column sums:

> word.usage<- colSums(dat)
> table(word.usage)

There is no hard-and-fast rule about the lower limit on the number or proportion of uses of a bigram necessary to keep it. We recommend starting out with a low number, such as 9, and working upwards from there as shown in the following code:

# first, set all values in the matrix that are greater than 1 to 1
> dat[dat>1] <- 1
> threshold <- 9  # set a threshold         
> tokeep <- which(word.usage>threshold)  # find which column sums are above the threshold
# keep all rows, and only columns with sums greater than the threshold
> dat.out<- dat[,tokeep] 

The last processing step is to drop users who use a very small number of bigrams. The logic is the same as when we dropped the least common bigrams: users who use different words from all the other users are hard to model. Again, there is a bit of art here. We recommend only keeping users who used at least two bigrams, though you may want to increase this if the documents you are using are larger than tweets, as shown in the following code:

# Drop users with few words....
# find how many zeroes are in each row
> num.zero <- rowSums(dat.out==0)

# explore data by making a table; can inform choice of cutoff
> table(num.zero)   
# the number of columns of the document bigram matrix    
> num_cols <- dim(dat.out)[2]  
# users must have used this many bigrams to scale
> cutoff <- 2     
# create a list of authors to keep        
> authors_tokeep <- which(num.zero <(num_cols-cutoff))  
# keep only users with 2 bigrams  
> dat.drop <- dat.out[authors_tokeep,] 
# similarly, drop those users from the vector of hashtags   
> outcome <- outcome[authors_tokeep]      

Finally, we are ready to implement the model. To do so, we'll load the e1071 package, a general-purpose data-mining package. Then, we set up the data so that it is back in the data frame format, with the outcome variable, hash, turned into a factor. Factors are data storage types that are good for holding integer-valued variables, as shown in the following code:

> require(e1071)
# append the outcome to the first column of dat.drop
> myDat <- cbind(outcome, dat.drop)
# turn the doc-term matrix into a data frame
> myDat <- as.data.frame(myDat)
# turn the outcome variable (first column) into a factor
> myDat[,1] <- as.factor(myDat[,1])

Finally, a single line of code implements the model. We will save the model as an object called NBmod. The first argument to the NaiveBayes function lists the predictors, while the second argument gives the outcome variable. We use a trick to capture all of the columns of myDat except the first one; using a negative number means "include all but this item", as shown in the following code:

# run the model; save the results to an object
> NBmod<- naiveBayes(myDat[,-1], myDat[,1])

We should expect our model to perform well on the data on which it was trained, or "in sample". To get a sense of how our model performed, we can make a confusion matrix that compares actual values of the outcome variable to predicted values as follows:

# generate a vector of predictions
# arguments: estimated model, predictors, outcome to predict
> NBpredictions <- predict(NBmod, myDat[,-1])
# pull out the actual outcomes
> actual<- myDat[,1]
# make the confusion matrix
> table(NBpredictions, actual, dnn=list("predicted", "actual"))

actual
predicted        0        1
        0           339     24
        1           617    825

Elements in this table on the main diagonal (the top-left and bottom-right cells) are correctly predicted. The table shows that, in the sample, our percent correctly predicted (PCP) is about 65 percent. The degree to which our model is accurate will be a function of several parameters. First, the more training data we use, the more accurate our model will be (in sample). Second, the model's accuracy will increase with the divergence in the word-use patterns between our two sentiment groups. Lastly, the larger the documents included in this type of analysis, the better the accuracy of the model. Thus, this model is a bit tenuous for Twitter data and only achieves a modest in-sample accuracy.

The point of this type of model is not to check its accuracy on existing sentiment data. Rather, we want to use this model to predict the sentiment of unlabeled observations. One simple way to accomplish this is to preprocess the unlabeled data along with the labeled data. Then, estimate the model on only the labeled data. Finally, use the trained model to predict the values of the unlabeled observations. To simulate this, suppose we had preprocessed some data as we did earlier but with 500 #prolife and 500 #prochoice tweets and additionally 100 tweets with the hashtag #abortion. Then, we could estimate the model and predict the values of the 100 unlabeled tweets with the following code:

# run the model on the 1000 labelled instances
> NBmod<- naiveBayes(myDat[1:1000,-1], myDat[1:1000,1])
# predict outcomes for the last 100 unlabeled instances
> NBpredictions<- predict(NBmod, myDat[1001:1100,-1])
# make a table of the predictions
> table(NBpredictions)

actual
predicted       0         1
        0           24        0
        1           27       49

Interestingly, this model predicts better on the test set than on the training set (73 percent accuracy). This is a bit of an anomaly; generally, we should not expect test results to be stronger than training accuracies, unless, by some chance, the test data is more well behaved than the training data. Overall, the Naive Bayes classifier is a useful tool for estimating sentiment valence. It is quick to estimate and has reasonable accuracy. However, as we saw in this example, it requires training data with binary scores already assigned.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.79.241