Building a text sentiment classifier with GloVe word embedding

Stanford University's Pennington, et al. developed an extension of the word2vec method that is called Global Vectors for Word Representation (GloVe) for efficiently learning word vectors. 

GloVe combines the global statistics of matrix factorization techniques, such as LSA, with the local context-based learning in word2vec. Also, unlike word2vec, rather than using a window to define local context, GloVe constructs an explicit word context or word co-occurrence matrix using statistics across the whole text corpus. As an effect, the learning model yields generally better word embeddings.

The text2vec library in R has a GloVe implementation that we could use to train to obtain word embeddings from our own training corpus. Alternatively, pretrained GloVe word embeddings can be downloaded and reused, similar to the way we did in the earlier word2vec pretrained embedding project covered in the previous section.

The following code block demonstrates the way in which GloVe word embeddings can be created and used for sentiment analysis, or, for that matter, any text classification task. We are not going to discuss explicitly the steps involved, since the code is already heavily commented with detailed explanations of each of the steps:

# including the required library
library(text2vec)
# setting the working directory
setwd('/home/sunil/Desktop/sentiment_analysis/')
# reading the dataset
text = read.csv(file='Sentiment Analysis Dataset.csv', header = TRUE)
# subsetting only the review text so as to create Glove word embedding
wiki = as.character(text$SentimentText)
# Create iterator over tokens
tokens = space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab = create_vocabulary(it)
# consider a term in the vocabulary if and only if the term has appeared aleast three times in the dataset
vocab = prune_vocabulary(vocab, term_count_min = 3L)
# Use the filtered vocabulary
vectorizer = vocab_vectorizer(vocab)
# use window of 5 for context words and create a term co-occurance matrix
tcm = create_tcm(it, vectorizer, skip_grams_window = 5L)
# create the glove embedding for each each in the vocab and
# the dimension of the word embedding should set to 50
# x_max is the maximum number of co-occurrences to use in the weighting
# function
# note that training the word embedding is time consuming - be patient
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 100)
wv_main = glove$fit_transform(tcm, n_iter = 10, convergence_tol = 0.01)

This will result in the following output:

INFO [2018-10-30 06:58:14] 2018-10-30 06:58:14 - epoch 1, expected cost 0.0231
INFO [2018-10-30 06:58:15] 2018-10-30 06:58:15 - epoch 2, expected cost 0.0139
INFO [2018-10-30 06:58:15] 2018-10-30 06:58:15 - epoch 3, expected cost 0.0114
INFO [2018-10-30 06:58:15] 2018-10-30 06:58:15 - epoch 4, expected cost 0.0100
INFO [2018-10-30 06:58:15] 2018-10-30 06:58:15 - epoch 5, expected cost 0.0091
INFO [2018-10-30 06:58:15] 2018-10-30 06:58:15 - epoch 6, expected cost 0.0084
INFO [2018-10-30 06:58:16] 2018-10-30 06:58:16 - epoch 7, expected cost 0.0079
INFO [2018-10-30 06:58:16] 2018-10-30 06:58:16 - epoch 8, expected cost 0.0074
INFO [2018-10-30 06:58:16] 2018-10-30 06:58:16 - epoch 9, expected cost 0.0071
INFO [2018-10-30 06:58:16] 2018-10-30 06:58:16 - epoch 10, expected cost 0.0068

The following uses the glove model to obtain the combined word vector:

# Glove model learns two sets of word vectors - main and context.
# both matrices may be added to get the combined word vector
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
# converting the word_vector to a dataframe for visualization
word_vectors=data.frame(word_vectors)
# the word for each embedding is set as row name by default
# using the tibble library rownames_to_column function, the rownames is copied as first column of the dataframe
# we also name the first column of the dataframe as words
library(tibble)
word_vectors=rownames_to_column(word_vectors, var = "words")
View(word_vectors)

This will result in the following output:

We make use of the softmaxreg library to obtain the mean word vector for each review. This is similar to what we did in word2vec pretrained embedding in the previous section. Observe that we are passing our own trained word embedding word_vectors to the wordEmbed() function, as follows:

library(softmaxreg)
docVectors = function(x)
{
wordEmbed(x, word_vectors, meanVec = TRUE)
}
# applying the function docVectors function on the entire reviews dataset
# this will result in word embedding representation of the entire reviews # dataset
temp=t(sapply(text$SentimentText, docVectors))
View(temp)

This will result in the following output:

We will now split the dataset into train and test portions, and use the randomforest library to build a model to train, as shown in the following lines of code:

# splitting the dataset into train and test portions
temp_train=temp[1:800,]
temp_test=temp[801:1000,]
labels_train=as.factor(as.character(text[1:800,]$Sentiment))
labels_test=as.factor(as.character(text[801:1000,]$Sentiment))
# using randomforest to build a model on train data
library(randomForest)
rf_senti_classifier=randomForest(temp_train, labels_train,ntree=20)
print(rf_senti_classifier)

This will result in the following output:

Call:
randomForest(x = temp_train, y = labels_train, ntree = 20)
Type of random forest: classification
Number of trees: 20
No. of variables tried at each split: 7


OOB estimate of error rate: 42.12%
Confusion matrix:
1 2 class.error
1 250 160 0.3902439
2 177 213 0.4538462

Then, we use the Random Forest model created to predict labels, as follows:

# predicting labels using the randomforest model created
rf_predicts<-predict(rf_senti_classifier, temp_test)
# estimating the accuracy from the predictions
library(rminer)
print(mmetric(rf_predicts, labels_test, c("ACC")))

This will result in the following output:

[1] 66.5

With this method, we obtain an accuracy of 66%. This is despite the fact that the word embeddings are obtained from words in just 1,000 text samples. The model may be further improved by using a pretrained embedding. The overall framework to use the pretrained embedding remains the same as what we did in word2vec project in the previous section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.216.75