Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Word2vec was developed by Tomas Mikolov, et al. at Google in 2013 as a response to making the neural-network-based training of the embedding more efficient, and since then it has become the de facto standard for developing pretrained word embedding.

Word2vec introduced the following two different learning models to learn the word embedding:

CBOW: Learns the embedding by predicting the current word based on its context.
Continuous Skip-Gram: The continuous Skip-Gram model learns by predicting the surrounding words given a current word.

Both CBOW and Skip-Gram methods of learning are focused on learning the words given their local usage context, where the context of the word itself is defined by a window of neighboring words. This window is a configurable parameter of the model.

The softmaxreg library in R offers pretrained word2vec word embedding that can be used for building our sentiment analysis engine for the Amazon reviews data. The pretrained vector is built using the word2vec model, and it is based on the Reuter_50_50 dataset, UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Reuter_50_50).

Without any delay, let's get into the code and also review the approach followed in this code:

# including the required library
library(softmaxreg)
# importing the word2vec pretrained vector into memory
data(word2vec)

Let's examine the word2vec pretrained emdeddings. It is just another data frame, and therefore can be reviewed through the regular dim and View commands as follows:

View(word2vec)

This will result in the following output:

Here, let's use the following dim command:

dim(word2vec)

This will result in the following output:

[1] 12853 21

From the preceding output, we can observe that there are 12853 words that have got word vectors in the pretrained vector. Each of the words is defined using 20 dimensions, and these dimensions define the context of the words. In the next step, we can look up the word vector for each of the words in the review comments. As there are only 12,853 words in the pretrained word embedding, there is a possibility that we encounter a word that does not exist in the pretrained embedding. In such a case, the unidentified word is represented with a 20 dimension vector that is filled with zeros.

We also need to understand that the word vectors are available only at a word level, and therefore in order to decode the entire review, we take the mean of all the word vectors of the words that made up the review. Let's review the concept of getting the word vector for a sentence from individual word vectors with an example.

Assume the sentence we want to get the word vector for is, it is very bright and sunny this morning. Individual words that comprise the sentence are it, is, very, bright, and, sunny, this, and morning.

Now, we can look up each of these words in the pretrained vector and get the corresponding word vectors as shown in the following table:

Word	dim1	dim2	dim3	.....	....	dim19	dim20
`it`	-2.25	0.75	1.75	-1.25	-0.25	-3.25	-2.25
`is`	0.75	1.75	1.75	-2.25	-2.25	0.75	-0.25
`very`	-2.25	2.75	1.75	-0.25	0.75	0.75	-2.25
`bright`	-3.25	-3.25	-2.25	-1.25	0.75	1.75	-0.25
`and`	-0.25	-1.25	-2.25	2.75	-3.25	-0.25	1.75
`sunny`	0	0	0	0	0	0	0
`this`	-2.25	-3.25	2.75	0.75	-0.25	-0.25	-0.25
`morning`	-0.25	-3.25	-2.25	1.75	0.75	2.75	2.75

Now, we have word vectors that comprise the sentence. Please note that these are not actual word vector values but just are made up to demonstrate the approach. Also, observe that the word sunny is represented with zeros across the dimensions to symbolize that the word is not found in the pretrained word embedding. In order to get the word vector for the sentence, we just compute the mean of each dimension. The resulting vector is a 1 x 20 vector representing the sentence, as follows:

Sentence

-1.21875

-0.71875

0.15625

0.03125

-0.46875

0.28125

-0.09375

The softmaxreg library offers the wordEmbed function where we could pass a sentence and ask it to compute the mean word vector for the sentence. The following code is a custom function that was created to apply the wordEmbed function on each of the Amazon reviews we have in hand. At the end of applying this function to the reviews dataset, we expect to have a n x 20 matrix that is the word vector representation of our reviews. The n in the n x 20 represents the number of rows and 20 is the number of dimensions through which each review is represented, as seen in the following code:

# function to get word vector for each review
docVectors = function(x)
{
  wordEmbed(x, word2vec, meanVec = TRUE)
}
# setting the working directory and reading the reviews dataset
setwd('/home/sunil/Desktop/sentiment_analysis/')
text = read.csv(file='Sentiment Analysis Dataset.csv', header = TRUE)
# applying the docVector function on each of the reviews
# storing the matrix of word vectors as temp
temp=t(sapply(text$SentimentText, docVectors))
# visualizing the word vectors output
View(temp)

This will result in the following output:

Then we review temp using the dim command, as follows:

dim(temp)

This will result in the following output:

1000 20

We can see from the output that we have word vectors created for each of the reviews in our corpus. This data frame can now be used to build classification models using an ML algorithm. The following code to achieve classification is no different from the one we did for the BoW approach:

# splitting the dataset into train and test
temp_train=temp[1:800,]
temp_test=temp[801:1000,]
labels_train=as.factor(as.character(text[1:800,]$Sentiment))
labels_test=as.factor(as.character(text[801:1000,]$Sentiment))
# including the random forest library
library(randomForest)
# training a model using random forest classifier with training dataset
# observe that we are using 20 trees to create the model
rf_senti_classifier=randomForest(temp_train, labels_train,ntree=20)
print(rf_senti_classifier)

This will result in the following output:

randomForest(x = temp_train, y = labels_train, ntree = 20)
               Type of random forest: classification
                     Number of trees: 20
No. of variables tried at each split: 4
        OOB estimate of  error rate: 44.25%
Confusion matrix:
    1   2 class.error
1 238 172   0.4195122
2 182 208   0.4666667

The preceding output shows that the Random Forest model object is successfully created. Of course, the model can be improved further; however we are not going to be doing that here as the focus is to demonstrate making use of word embeddings, rather than getting the best performing classifier.

Next, with the following code we make use of the Random Forest model to make predictions on the test data and then report out the performance:

# making predictions on the dataset
rf_predicts<-predict(rf_senti_classifier, temp_test)
library(rminer)
print(mmetric(rf_predicts, labels_test, c("ACC")))

This will result in the following output:

[1] 62.5

We see that we get a 62% accuracy from using the pretrained word2vec embeddings made out of the Reuters news group's dataset.

Table of Contents for Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a text sentiment classifier with pretrained word2vec word embedding based on Reuters news corpus