Building a text sentiment classifier with fastText

fastText is a library and is an extension of word2vec for word representation. It was created by the Facebook Research Team in 2016. While Word2vec and GloVe approaches treat words as the smallest unit to train on, fastText breaks words into several n-grams, that is, subwords. For example, the trigrams for the word apple are app, ppl, and ple. The word embedding for the word apple is sum of all the word n-grams. Due to the nature of the algorithm's embedding generation, fastText is more resource-intensive and takes additional time to train. Some of the advantages of fastText are as follows:

  • It generates better word embeddings for rare words (including misspelled words).
  • For out of vocabulary words, fastText can construct the vector for a word from its character n-grams, even if a word doesn't appear in training corpus. This is not a possibility for both Word2vec and GloVe.

The fastTextR library provides an interface to the fastText. Let's make use of the fastTextR library for our project to build a sentiment analysis engine on Amazon reviews. While it is possible to download pretrained fastText word embedding and make use of it for our project, let's make an attempt to train a word embedding based on the reviews dataset we have in hand. It should be noted that the approach in terms of making use of fastText pretrained word embedding is similar to the approach we followed in the word2vec based project that we dealt with earlier. 

Similar to the project covered in the previous section, comments are included inline in the code. The comments explain each of the lines indicating the approach taken to build the Amazon reviews sentiment analyzer in this project. Let's look into the following code now:

# loading the required libary
library(fastTextR)
# setting the working directory
setwd('/home/sunil/Desktop/sentiment_analysis/')
# reading the input reviews file
# recollect that fastText needs the file in a specific format and we created one compatiable file in
# "Understanding the Amazon Reviews Dataset" section of this chaptertext = readLines("Sentiment Analysis Dataset_ft.txt")
# Viewing the text vector for conformation
View(text)

This will result in the following output:

Now let's divide the reviews into training and test datasets, and view them using the following lines of code:

# dividing the reviews into training and test
temp_train=text[1:800]temp_test=text[801:1000]
# Viewing the train datasets for confirmation
View(temp_train)

This will give the following output:

Use the following code to view the test dataset:

View(temp_test)

This will give the following output:

We will now create a .txt file for the train and test dataset using the following code:

# creating txt file for train and test dataset
# the fasttext function expects files to be passed for training and testing
fileConn<-file("/home/sunil/Desktop/sentiment_analysis/train.ft.txt")
writeLines(temp_train, fileConn)
close(fileConn)
fileConn<-file("/home/sunil/Desktop/sentiment_analysis/test.ft.txt")
writeLines(temp_test, fileConn)
close(fileConn)
# creating a test file with no labels
# recollect the original test dataset has labels in it
# as the dataset is just a subset obtained from full dataset
temp_test_nolabel<- gsub("__label__1", "", temp_test, perl=TRUE)
temp_test_nolabel<- gsub("__label__2", "", temp_test_nolabel, perl=TRUE)

Now we will view the no labels test dataset for confirmation using the following command:

View(temp_test_nolabel)

This will result in the following output:

Let's now write the no labels test dataset to a file so we can use it for testing, as follows:

fileConn<-file("/home/sunil/Desktop/sentiment_analysis/test_nolabel.ft.txt")
writeLines(temp_test_nolabel, fileConn)
close(fileConn)
# training a supervised classification model with training dataset file
model<-fasttext("/home/sunil/Desktop/sentiment_analysis/train.ft.txt",
method = "supervised", control = ft.control(nthreads = 3L))
# Obtain all the words from a previously trained model=
words<-get_words(model)
# viewing the words for confirmation. These are the set of words present # in our training data
View(words)

This will result in the following output:

Now we will obtain the word vectors from a previously trained model and view the word vectors for each word in our training dataset, as follows:

# Obtain word vectors from a previously trained model.
word_vec<-get_word_vectors(model, words)
# Viewing the word vectors for each word in our training dataset
# observe that the word embedding dimension is 5
View(word_vec)

This will result in the following output:

We will predict the labels for the reviews in the no labels test dataset and write it to a file for future reference. Then we will get the predictions into a data frame to compute the performance and see the estimate of the accuracy using the following lines of code:

# predicting the labels for the reviews in the no labels test dataset
# and writing it to a file for future reference
predict(model, newdata_file= "/home/sunil/Desktop/sentiment_analysis/test_nolabel.ft.txt",result_file="/home/sunil/Desktop/sentiment_analysis/fasttext_result.txt")
# getting the predictions into a dataframe so as to compute performance # measurementft_preds<-predict(model, newdata_file= "/home/sunil/Desktop/sentiment_analysis/test_nolabel.ft.txt")
# reading the test file to extract the actual labels
reviewstestfile<
readLines("/home/sunil/Desktop/sentiment_analysis/test.ft.txt")
# extracting just the labels frm each line
library(stringi)
actlabels<-stri_extract_first(reviewstestfile, regex="\w+")
# converting the actual labels and predicted labels into factors
actlabels<-as.factor(as.character(actlabels))
ft_preds<-as.factor(as.character(ft_preds))
# getting the estimate of the accuracy
library(rminer)
print(mmetric(actlabels, ft_preds, c("ACC")))

This will result in the following output:

[1] 58

We have a 58% accuracy with the fastText method on our reviews data. As a next step, we could check whether the accuracy may be further improved by making use of fastText pretrained word embedding. As we already know, implementing a project by making use of pretrained embedding is not very different from the implementation that we followed in the word2vec project described in the earlier section of this chapter. The difference is just that the training step to obtain word embedding needs to be discarded and the model variable in the code covered in this project code should be initiated with the pretrained word embeddings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.135.175