Chapter 6. Sentiment Analysis – Categorizing Hotel Reviews

People talk about a lot of things online. There are forums and communities for almost everything under the sun, and some of them may be about your product or service. People may complain, or they may praise, and you would want to know which of the two they're doing.

This is where sentiment analysis helps. It can automatically track whether the reviews and discussions are positive or negative overall, and it can pull out items from either category to make them easier to respond to or draw attention to.

Over the course of this chapter, we'll cover a lot of ground. Some of it will be a little hazy, but in general, here's what we'll cover:

  • Exploring and preparing the data
  • Understanding the classifiers
  • Running the experiment
  • Examining the error rates

Before we go any further, let's learn what sentiment analysis is.

Understanding sentiment analysis

Sentiment analysis is a form of text categorization that works on opinions instead of topics. Often, texts are categorized according to the subject they discuss. For example, sentiment analysis attempts to categorize texts according to the opinions or emotions of the writers, whether the text is about cars or pets. Often, these are cast in binary terms: good or bad, like or dislike, positive or negative, and so on. Does this person love Toyotas or hate them? Are Pugs the best or German Shepherds? Would they go back to this restaurant? Questions like these have proven to be an important area of research, simply because so many companies want to know what people say about their goods and services online. This provides a way for companies' marketing departments to monitor people's opinions about their products or services as they talk on Twitter and other online public forums. They can reach out to unhappy customers to provide better, more proactive customer service or reach out to satisfied ones to strengthen their relationships and opinions.

As you can imagine, categorizing based on opinion than on topics is much more difficult. Even basic words tend to take on multiple meanings that are very dependent on their contexts.

For example, take the word good. In a review, I can say that something is good. I can also say that it's not good, no good, or so far from good that It can almost see it on a clear day. On the other hand, I can say that something's bad. Or can I say that it's not bad. Or, if I'm stuck in the '80s, I can say that "I love it, it's so bad."

This is a very important and interesting problem, so people have been working on it for a number of years. An early paper on this topic came in 2002, Thumbs up? Sentiment classification using machine learning techniques, published by Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. In this paper, they compared movie reviews using naive Bayes' maximum entropy and support vector machines to categorize movie reviews into positive and negative. They also compared a variety of feature types such as unigrams, bigrams, and other combinations. In general, they found that support vector machines with single tokens performed best, although the difference wasn't usually huge.

Together and separately, Bo Pang, Lillian Lee, and many others have extended sentiment analysis in interesting ways. They've attempted to go beyond simple binary classifications toward predicting finer-grained sentiments. For example, they've worked on systems to predict from a document the number of stars the author of the review would give the reviewed service or object on a four-star or five-star rating system.

Part of what makes this interesting is that the baseline is how well the system explicitly agrees with the judgment of the human raters. However, in research, human raters only agree about 79 percent of the time, so a system that agrees with human raters 60 or 70 percent of the time is doing pretty well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.2.240