Sentiment analysis algorithms

Supposing we wanted to broadly classify the sentiment of a text as positive or negative, we may choose to model the opinion mining task as a classification problem, such as could be solved with supervised machine learning techniques like a Naïve Bayes classifier (NBC). Given a set of positive text features and negative text features, an NBC strategy will allow us to take a new text and classify it as being more positive or more negative given the observations about other similar texts we have made in the past. The machine learning literature is replete with examples of supervised classification, and it is a very reliable approach for certain types of problems.

The trick of course with this type of classification scheme is being able to count on the observations we have made in the past as reliable indicators of future observations. These training examples are critically important and are the basis for the success of the entire scheme. After all, if we choose training examples that are too generic or irrelevant to our domain, the system will not be able to correctly classify future cases. On the other hand, if we train a system too closely, we risk overfitting, or generating a classifier that only works with one specific set of data.

In the case of sentiment analysis of movie reviews for example, suppose the training examples are 1,000 old movie reviews that have been pre-divided roughly equal groups of positive and negative reviews. An NBC approach would learn which features were most important about these original reviews and look for these same features in the new reviews, giving each new review a percentage chance of being positive or negative. Some feature engineering is usually expected as well with a supervised classification scheme, for example, directing the classifier to take into account the particular language of the domain in question, or to pay more attention to parts of speech or word positions or frequency of certain words or phrases. Earlier we mentioned that the word unpredictable might be a positive word in movie reviews, so we could explicitly engineer that feature accordingly.

Some of the general-purpose tools available for sentiment mining already have the classification scheme baked in, so they are ready for us to feed in sentences. In our example at the end of this chapter, we use some of the built-in sentiment classifiers that have been pre-trained with positive and negative words. In the next section, we will take a look at some of these pre-collected and pre-classified sentiment collections.

General-purpose data collections

The sentiment classification systems described previously rely on training examples in order to learn what kind of text is positive or negative. These training examples can be manually classified (a task sometimes called coding) by humans into positive and negative, or whatever the intended classes are. Once coded, the training set can be reused over and over again to train other classifiers. Therefore, this training process and the resulting classified word lists are very important to the overall sentiment analysis process as they save a lot of time and also let us compare results to each other. But where do the training examples come from? There are many lists of sentiment-classified words online, and even some canned datasets that we can use to test a classifier. Three of the more commonly used word lists, or lexicons, are described here.

Hu and Liu's sentiment analysis lexicon

Minqing Hu and Bing Liu's list of 6,800 words is one of the first sentiment lexicons that was made available for public use. The lexicon is still available on Liu's university website at https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon. This site also has a handful of datasets of product reviews that can be used to test your classifier. When uncompressed from RAR format, the word list is divided into two text files, positive.txt and negative.txt. Each file contains a simple list of words. There are about 4,800 negative words provided and the remaining 2,000 words appear in the positive list. The first 10 positive words are shown in the following example:

a+
abound
abounds
abundance
abundant
accessable
accessible
acclaim
acclaimed
acclamation

The authors point out that they have included many words that are commonly misspelled on social media, for example the misspelling accessable appears in the preceding list for this reason. In contrast to the next two lexicon samples, there is no scale or ranking for the sentiment of each word. The words are stated to be either positive or negative, without gradients or comparisons to each other.

SentiWordNet

SentiWordNet is another list of words that have been coded as positive or negative. The file is available for download, and as of writing this is a gzipped file of about 13 MB. It is available at: http://sentiwordnet.isti.cnr.it. There are approximately 117,000 words, scores, and definitions (called glosses) in the file.

The header row and first two entries look like this:

# POS  ID  PosScore  NegScore  SynsetTerms  Gloss

a  00001740  0.125  0  able#1  (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"

a  00002098  0  0.75  unable#1  (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"

The first word in the file shown previously is able. This word is given a unique number, 00001740, and is marked a for its POS, or part of speech, which is an adjective. This word has a positive score of .125 and a negative score of 0. The second word is unable, hich has a positive score of 0 and a negative score of .75. The SentiWordNet documentation also explains how to calculate a score for each the objectivity of each word. The objectivity score is the sum of the positive and negative scores for a word, subtracted from 1. This score measures how neutral the word is. Words that have low scores for both positivity and negativity will end up scoring high on objectivity.

Vader sentiment

The Vader sentiment tool, created by C.J. Hutto and Eric Gilbert at Georgia Tech, includes a lexicon and many test files. Vader is especially tuned for social media data, and in particular microblogging data such as tweets. As such, it includes many emoticons, such as :-), and acronyms, such as lol and wtf. The project is available on GitHub at https://github.com/cjhutto/vaderSentiment and the specific lexicon is available inside that project at the following URL: https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_sentiment_lexicon.txt.

Vader is also one of the built-in classifiers that comes with the Python Natural Language ToolKit (NLTK) and is therefore found in many sample projects online. When we look inside the vader_sentiment_lexicon.txt file, we see that a typical set of lines looks like this:

burdens   -1.5  0.5    [-2, -2, -1, -1, -2, -1, -1, -1, -2, -2]
burdensome -1.8  0.9798    [-1, -1, -3, -2, -1, -2, -2, -1, -4, -1]
bwahaha    0.4  1.0198    [0, 1, 0, 1, 0, 2, -1, -1, 2, 0]
bwahahah    2.5  0.92195 [3, 4, 2, 2, 2, 3, 1, 2, 2, 4]
calm    1.3  0.78102 [1, 1, 0, 1, 2, 3, 2, 1, 1, 1]
calmative    1.1  0.9434    [3, 2, -1, 1, 1, 1, 1, 1, 1, 1]

The word appears first, followed by its mean rating or score, the population standard deviation (calculated as stdevp), and a bracketed list of the individual ratings of 10 independent scorers. The Vader sentiment documentation explains the scoring:

"Features were rated on a scale from "[–4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)".... We kept every lexical feature that had a non-zero mean rating, and whose standard deviation was less than 2.5 as determined by the aggregate of ten independent raters."

The result is a list of about 7,500 words, each of which is scored both on polarity, or whether it is a positive or negative word, along with the sentiment intensity, or how positive or how negative the word is.

In the next section, we will begin to construct a sentiment analysis application. We will learn how to use the NLTK Python package and the Vader tool to sentiment analyze a set of texts. We will walk through a typical sentiment analysis project, comparing results and various options for improving it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.81.166