Using NLP for sentiment analysis

The approach presented in this section is based on the use case of classifying a high rate of incoming stream tweets. The task at hand is to extract the embedded sentiments within the tweets about a chosen topic. The sentiment classification quantifies the polarity in each tweet in real time and then aggregate the total sentiments from all tweets to capture the overall sentiments about the chosen topic. To face the challenges posed by the content and behavior of Twitter stream data and perform the real-time analytics efficiently, we use NLP by using a trained classifier. The trained classifier is then plugged into the Twitter stream to determine the polarity of each tweet (positive, negative, or neutral), followed by the aggregation and determination of the overall polarity of all tweets about a certain topic. Let's see how this is done step by step.

First, we have to train the classifier. In order to train the classifier, we needed an already-prepared dataset that has historical Twitter data and follows the patterns and trends of the real-time data. Therefore, we used a dataset from the website www.sentiment140.com, which comes with a human-labeled corpus (a large collection of texts upon which the analysis is based) with over 1.6 million tweets. The tweets within this dataset have been labeled with one of three polarities: zero for negative, two for neutral, and four for positive. In addition to the tweet text, the corpus provides the tweet ID, date, flag, and the user who tweeted. Now let's see each of the operations that are performed on the live tweet before it reaches the trained classifier:

Tweets are first split into individual words called tokens (tokenization).
The output from tokenization creates a BoW, which is a collection of individual words in the text.
These tweets are further filtered by removing numbers, punctuations, and stop words (stop word removal). Stop words are words that are extremely common, such as is, am, are, and the. As they hold no additional information, these words are removed.
Additionally, nonalphabetical characters, such as #@ and numbers, are removed using pattern matching, as they hold no relevance in the case of sentiment analysis. Regular expressions are used to match alphabetical characters only and the rest are ignored. This helps to reduce the clutter from the Twitter stream.
The outcomes of the prior phase are taken to the stemming phase. In this phase, the derived words are reduced to their roots—for example, a word like fish has the same roots as fishing and fishes. For this, we use the library of standard NLP, which provides various algorithms, such as Porter stemming.

Once the data is processed, it is converted into a structure called a term document matrix (TDM). The TDM represents the term and frequency of each work in the filtered corpus.
From the TDM, the tweet reaches the trained classifier (as it is trained, it can process the tweet), which calculates the sentimental polarity importance (SPI) of each word which is a number from -5 to +5. The positive or negative sign specifies the type of emotions represented by that particular word, and its magnitude represents the strength of sentiment. This means that the tweet can be classified as either positive or negative (refer to the following image). Once we calculate the polarity of the individual tweets, we sum their overall SPI to find the aggregated sentiment of the source—for example, an overall polarity greater than one indicates that the aggregated sentiment of the tweets in our observed period of time is positive.

To retrieve the real-time raw tweets, we use the Scala library Twitter4J, a Java library that provides a package for a real-time Twitter streaming API. The API requires the user to register a developer account with Twitter and fill in some authentication parameters. This API allows you to either get random tweets or filter tweets using chosen keywords. We used filters to retrieve tweets related to our chosen keywords.

The overall architecture is shown in the following figure:

The sentiment analysis have various applications. It can be applied to classify the feedback from customers. Social media polarity analysis can be used by governments to find the effectiveness of their policies. It can also quantify the success of various advertisement campaigns.

In the next section, we will learn how we can actually apply sentiment analysis to predict the sentiments of movie reviews.

Table of Contents for Using NLP for sentiment analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Using NLP for sentiment analysis