Chapter 6. Detecting Liars and the Confused in Contradictory Online Reviews

Jacob Perkins

Did you know that people lie for their own selfish reasons? Even if this is totally obvious to you, you may be surprised at how blatant this practice has become online, to the point where some people will explain their reasons for lying immediately after doing so.

I knew unethical people would lie in online reviews in order to inflate ratings or attack competitors, but what I didn’t know, and only learned by accident, is that individuals will sometimes write reviews that completely contradict their associated rating, without any regard to how it affects a business’s online reputation. And often this is for businesses that an individual likes.

How did I learn this? By using ratings and reviews to create a sentiment corpus, I trained a sentiment analysis classifier that could reliably determine the sentiment of a review. While evaluating this classifier, I discovered that it could also detect discrepancies between the review sentiment and the corresponding rating, thereby finding liars and confused reviewers. Here’s the whole story of how I used text classification to identify an unexpected source of bad data...

Weotta

At my company, Weotta,[8] we produce applications and APIs for navigating local data in ways that people actually care about, so we can answer questions like: Is there a kid-friendly restaurant nearby? What’s the nearest hip yoga studio? What concerts are happening this weekend?

To do this, we analyze, aggregate, and organize local data in order to classify it along dimensions that we can use to answer these questions. This classification process enables us to know which restaurants are classy, which bars are divey, and where you should go on a first date. Online business reviews are one of the major input signals we use to determine these classifications. Reviews can tell us the positive or negative sentiment of the reviewer, as well as what they specifically care about, such as quality of service, ambience, and value. When we aggregate reviews, we can learn what’s popular about the place and why people like or dislike it. We use many other signals besides reviews, but with the proper application of natural language processing,[9] reviews are a rich source of significant information.

Getting Reviews

To get reviews, we use APIs where possible, but most reviews are found using good old-fashioned web scraping. If you can use an API like CityGrid[10] to get the data you need, it will make your life much easier, because while scraping isn’t necessarily difficult, it can be very frustrating. Website HTML can change without notice, and only the simplest or most advanced scraping logic will remain unaffected. But the majority of web scrapers will break on even the smallest of HTML changes, forcing you to continually monitor and maintain your scrapers. This is the dirty secret of web mining: the end result might be nice and polished data, but the process is more akin to janitorial work where every mess is unique and it never stays clean for long.

Once you’ve got reviews, you can aggregate ratings to calculate an average rating for a business. One problem is that many sources don’t include ratings with their reviews. So how can you accurately calculate an average rating? We wanted to do this for our data, as well as aggregate the overall positive sentiment from all the reviews for a business, independent of any average rating. With that in mind, I figured I could create a sentiment classifier,[11] using rated reviews as a training corpus. A classifier works by taking a feature set and determining a label. For sentiment analysis, a feature set is a piece of text, like a review, and the possible labels can be pos for positive text, and neg for negative text. Such a sentiment classifier could be run over a business’s reviews in order to calculate an overall sentiment, and to make up for any missing rating information.

Sentiment Classification

NLTK,[12] Python’s Natural Language ToolKit, is a very useful programming library for doing natural language processing and text classification.[13] It also comes with many corpora that you can use for training and testing. One of these is the movie_reviews corpus,[14] and if you’re just learning how to do sentiment classification, this is a good corpus to start with. It is organized into two directories, pos and neg. In each directory is a set of files containing movie reviews, with every review separated by a blank line. This corpus was created by Pang and Lee,[15] and they used ratings that came with each review to decide whether that review belonged in pos or neg. So in a 5-star rating system, 3.5 stars and higher reviews went into the pos directory, while 2.5 stars and lower reviews went into the neg directory. The assumption behind this is that high rated reviews will have positive language, and low rated reviews will have more negative language. Polarized language is ideal for text classification, because the classifier can learn much more precisely those words that indicate pos and those words that indicate neg.

Because I needed sentiment analysis for local businesses, not movies, I used a similar method to create my own sentiment training corpus for local business reviews. From a selection of businesses, I produced a corpus where the pos text came from 5 star reviews, and the neg text came from 1 star reviews. I actually started by using both 4 and 5 star reviews for pos, and 1 and 2 star reviews for neg, but after a number of training experiments, it was clear that the 2 and 4 star reviews had less polarizing language, and therefore introduced too much noise, decreasing the accuracy of the classifier. So my initial assumption was correct, though the implementation of it was not ideal. But because I created the training data, I had the power to change it in order to yield a more effective classifier. All training-based machine learning methods work on the general principle of “garbage in, garbage out,” so if your training data is no good, do whatever you can to make it better before you start trying to get fancy with algorithms.

Polarized Language

To illustrate the power of polarized language, what follows is a table showing some of the most polarized words used in the movie_reviews corpus, along with the occurrence count in each category, and the Chi-Squared information gain, calculated using NLTK’s BigramAssocMeasures.chi_sq() function in the nltk.metrics.association module.[16] This Chi-Square metric is a modification of the Phi-Square measure of the association between two variables, in this case a word and a category. Given the number of times a word appears in the pos category, the total number of times it occurs in all categories, the number of words in the pos category, and the total number of words in all categories, we get an association score between a word and the pos category. Because we only have two categories, the pos and neg scores for a given word will be the same because the word is equally significant either way, but the interpretation of that significance depends on the word’s relative frequency in each category.

WordPosNegChi-Sq

bad

361

1034

399

worst

49

259

166

stupid

45

208

123

life

1057

529

126

boring

52

218

120

truman

152

11

108

why

317

567

99

great

751

397

76

war

275

94

71

awful

21

111

71

Some of the above words and numbers should be mostly obvious, but others not so much. For example, many people might think “war” is bad, but clearly that doesn’t apply to movies. And people tend to ask “why” more often in negative reviews compared to positive reviews. Negative adjectives are also more common, or at least provide more information for classification than the positive adjectives. But there can still be category overlap with these adjectives, such as “not bad” in a positive review, or “not great” in a negative review. Let’s compare this to some of the more common and less polarizing words:

WordPosNegChi-Sq

good

1248

1163

0.6361

crime

115

93

0.6180

romantic

137

117

0.1913

movies

635

571

0.0036

hair

57

52

0.0033

produced

67

61

0.0026

bob

96

86

0.0024

where

785

707

0.0013

face

188

169

0.0013

take

476

429

0.0003

You can see from this that “good” is one of the most overused adjectives and confers very little information. And you shouldn’t name a movie character “Bob” if you want a clear and strong audience reaction. When training a text classifier, these low-information words are harmful noise, and as such should either be discarded or weighted down, depending on the classification algorithm you use. Naive Bayes[17] in particular does not do well with noisy data, while a Logistic Regression classifier (also known as Maximum Entropy)[18] can weigh noisy features down to the point of insignificance.

Corpus Creation

Here’s a little more detail on the corpus creation: we counted each review as a single instance. The simplest way to do this is to replace all newlines in a review with a single space, thereby ensuring that each review looks like one paragraph. Then separate each review/paragraph by a blank line, so that it is easy to identify when one review ends and the next begins. Depending on the number of reviews per category, you may want to have multiple files per category, each containing some reasonable number of reviews separated by blank lines. With multiple files, you should either have separate directories for pos and neg, like the movie_reviews corpus, or you could use easily identified filename patterns. The simplest way to do it is to copy something that already exists, so you can reuse any code that recognizes that organizational pattern. The number of reviews per file is up to you; what really matters is the number of reviews per category. Ideally you want at least 1000 reviews in each category; I try to aim for at least 10,000, if possible. You want enough reviews to reduce the bias of any individual reviewer or item being reviewed, and to ensure that you get a good number of significant words in each category, so the classifier can learn effectively.

The other thing you need to be concerned about is category balance. After producing a corpus of 1 and 5 star reviews, I had to limit the number of pos reviews significantly in order to balance the pos and neg categories, because it turns out that there’s far more 5 star reviews than there are 1 star reviews. It seems that online, most businesses are above average, as you can see in this chart showing the percentage of each rating.

RatingPercent

5

32%

4

35%

3

17%

2

9%

1

7%

People are clearly biased towards higher rated reviews; there are nearly five times as many 5 star reviews as 1 star reviews. So it might make sense that a sentiment classifier should be biased the same way, and all else being equal, favor pos classifications over neg. But there’s a design problem here: if a sentiment classifier is more biased towards the pos class, it will produce more false positives. And if you plan on surfacing these positive reviews, showing them to normal people that have no insight into how a sentiment classifier works, you really don’t want to show a false positive review. There’s a lot of cognitive dissonance when you claim that a business is highly rated and most people like it, while at the same time showing a negative review. One of the worst things you can do when designing a user interface is to show conflicting messages at the same time. So by balancing the pos and neg categories, I was able to reduce that bias and decrease false positives. This was accomplished by simply pruning the number of pos reviews until it was equal to the number of neg reviews.

Training a Classifier

Now that I had a polarized and balanced training corpus, it was trivial to train a classifier using a classifier training script from nltk-trainer.[19] nltk-trainer is an open source library of scripts I created for training and analyzing NLTK models. For text classification, the appropriate script is train_classifier.py. Just a few hours of experimentation lead to a highly accurate classifier. Below is an example of how to use train_classifier.py, and the kind of stats I saw:

nltk-trainer$ ./train_classifier.py review_sentiment --no-pickle 
    --classifier MEGAM --ngrams 1 --ngrams 2 --instances paras 
    --fraction 0.75
loading review_sentiment
2 labels: ['neg', 'pos']
22500 training feats, 7500 testing feats
[Found megam: /usr/local/bin/megam]
training MEGAM classifier
accuracy: 0.913465
neg precision: 0.891415
neg recall: 0.931725
pos precision: 0.947058
pos recall: 0.910265

With these arguments, I’m using the MEGAM algorithm for training a MaxentClassifier using each review paragraph as a single instance, looking at both single words (unigrams) and pairs of words (bigrams). The MaxentClassifier (or Logistic Regression), uses an iterative algorithm to determine weights for every feature. These weights can be positive or negative for a category, meaning that the presence of a word can imply that a feature set belongs to a category and/or that a feature set does not belong to different category. So referring to the previous word tables, we can expect that “worst” will have a positive weight for the neg category, and a negative weight for the pos category. The MEGAM algorithm is just one of many available training algorithms, and I prefer it for its speed, memory efficiency, and slight accuracy advantage over the other available algorithms.

The other options used above are --no-pickle, which means to not save the trained classifier to disk, and --fraction, which specifies how much of the corpus is used for training, with the remaining fraction used for testing. train_classifier.py has many other options, which you can see by using the --help option. These include various algorithm-specific training options, what constitutes an instance, which ngrams to use, and many more.

If you’re familiar with classification algorithms, you may be wondering why I didn’t use Naive Bayes. This is because my tests showed that Naive Bayes was much less accurate than Maxent, and that even combining the two algorithms did not beat Maxent by itself. Naive Bayes does not weight its features, and therefore tends to be susceptible to noisy data, which I believe is the reason it did not perform too well in this case. But your data is probably different, and you may find opposite results when you conduct your experiments.

I actually wrote the original code behind train_classifier.py for this project so that I could design and modify classifier training experiments very quickly. Instead of copy and paste coding and endless script modifications, I was able to simply tweak command line arguments to try out a different set of training parameters. I encourage you to do the same, and to perform many training experiments in order to arrive at the best possible set of options.

After I’d created this script for text classification, I added training scripts for part-of-speech tagging[20] and chunking,[21] leading to the creation of the whole nltk-trainer project and its suite of training and analysis scripts. I highly recommend trying these out before attempting to create a custom NLTK based classifier, or any NLTK model, unless you really want to know how the code works, and/or have custom feature extraction methods you want to use.

Validating the Classifier

But back to the sentiment classifier: no matter what the statistics say, over the years I’ve learned to not fully trust trained models and their input data. Unless every training instance has been hand-verified by three professional reviewers, you can assume there’s some noise and/or inaccuracy in your training data. So once I had trained what appeared to be a highly accurate sentiment classifier, I ran it over my training corpus in order to see if I could find reviews that were misclassified by the classifier. My goal was to figure out where the classifier went wrong, and perhaps get some insight into how to tweak the training parameters for better results. To my surpise, I found reviews like this in the pos/5-star category, which the classifier was classifying as neg:

It was loud and the wine by the glass is soo expensive. Thats the only negative because it was good.

And in the neg/1-star category, there were pos reviews like this:

One of the best places in New York for a romantic evening. Great food and service at fair prices.

We loved it! The waiters were great and the food came quickly and it was delicious. 5 star for us!

The classifier actually turned out to be more accurate at detecting sentiment than the ratings used to create the training corpus! Apparently, one of the many bizarre things people do online is write reviews that completely contradict their rating. While trying to create a sentiment classifier, I had accidentally created a way to identify both liars and the confused. Here’s some 1-star reviews by blatant liars:

This is the best BBQ ever! I’m not just saying that to keep you fools from congesting my favorite places.

Quit coming to my favorite Karaoke spot. I found it first.

While these at least have some logic behind them, I completely disagree with the motivation. By giving 1 star with their review, these reviewers are actively harming the reputation of their favorite businesses for their own selfish short-term gain.

On the other side of the sentiment divide, here’s a mixed sentiment comment from a negative 5-star review:

This place sucks, do not come here, dirty, unfriendly staff and bad workout equipment. MY club, do you hear me, MY, MY, MY club. STAY AWAY! One of the best clubs in the bay area. All jokes aside, this place is da bomb.

This kind of negative 5-star review could also be harming the business’s reputation. The first few sentences may be a joke, but those are also the sentences people are more likely to read, and this review is saying some pretty negative things, albeit jokingly. And then there’s people that really shouldn’t be writing reviews:

I like to give A’s. I dont want to hurt anyones feelings. A- is the lowest I like to give. A- is the new F.

The whole point of reviews and ratings is to express your opinion, and yet this reviewer seems afraid to do just that. And here’s an actual negative opinion from a 5-star review:

My steak was way over-cooked. The menu is very limited. Too few choices

If the above review came with a 1 or 2 star rating, that’d make sense. But a 5-star rating for a limited menu and overcooked steak? I’m not the only one who’s confused.

Finally, just to show that my sentiment classifier isn’t perfect, here’s a 5-star review that’s actually positive, but the reviewer uses a double negative, which causes the classifier to give it a negative sentiment:

Never had a disappointing meal there.

Double negatives, negations such as “not good,” sarcasm, and other language idioms can often confuse sentiment analysis systems and are an area of ongoing research. Because of this, I believe that it’s best to exclude such reviews from any metrics about a business. If you want a clear signal, you often have to ignore small bits of contradictory information.

Designing with Data

We use the sentiment classifier in another way, too. As I mentioned earlier, we show reviews of places to provide our users with additional context and confirmation. And because we try to show only the best places, the reviews we show should reflect that. This means that every review we show needs to have a strong positive signal. And if there’s a rating included with the review, it needs to be high too, because we don’t want to show any confused high-rated reviews or duplicitous low-rated reviews. Otherwise, we’d just confuse our own users.

Before adding the sentiment classifier as a critical component of our review selection method, we were simply choosing reviews based on rating. And when we didn’t have a rating, we were choosing the most recent reviews. Neither of these methods was satisfactory. As I’ve shown above, you cannot always trust ratings to accurately reflect the sentiment of a review. And for reviews without a rating, anyone could say anything and we had no signal to use for filtering out the negative reviews. Now some might think we should be showing negative reviews to provide a balanced view of a business. But our goal in this case is not to create a research tool—there’s plenty of other sites and apps that are already great for that. Our goal is to show you the best, most relevant places for your occasion. If every other signal is mostly positive, then showing negative reviews is a disservice to our users and results in a poor experience. By choosing to show only positive reviews, the data, design, and user experience are all congruent, helping our users choose from the best options available based on their own preferences, without having to do any mental filtering of negative opinions.

Lessons Learned

One important lesson for machine learning and statistical natural language processing enthusiasts: it’s very important to train your own models on your own data. If I had used classifiers trained on the standard movie_reviews corpus, I would never have gotten these results. Movie reviews are simply different than local business reviews. In fact, it might be the case that you’d get even better results by segmenting businesses by type, and creating classifiers for each type of business. I haven’t run this experiment yet, but it might lead to interesting research. The point is, your models should be trained on the same kind of data they need to analyze if you want high accuracy results. And when it comes to text classification and sentiment analysis in particular, the domain really matters. That requires creating a custom corpus [22] and spending at least a few hours on experiments and research to really learn about your data in order to produce good models.

You must then take a critical look at your training data, and validate your training models against it. This is the only way to know what your model is actually learning, and if your training data is any good. If I hadn’t done any model validation, I would never have discovered these bad reviews, nor realized that my sentiment classifier could detect inconsistent opinions and outright lying. In a sense, these bad reviews are a form of noise that has been maliciously injected into the data. So ask yourself, what forms of bad data might be lurking in your data stream?

Summary

The process I went through can be summarized as:

  1. Get relevant data.

  2. Create a custom training corpus.

  3. Train a model.

  4. Validate that model against the training corpus.

  5. Discover something interesting.

At steps 3-5, you may find that your training corpus is not good enough. It could mean you need to get more relevant data. Or that the data you have is too noisy. In my case, I found that 2 and 4 star reviews were not polarizing enough, and that there was an imbalance between the number of 5-star reviews and the number of 1-star reviews.

It’s also possible that your expectations for machine learning are too high, and you need to simplify the problem. Natural language processing and machine learning are imperfect methods that rely on statistical pattern matching. You cannot expect 100% accuracy, and the noisier the data is, the more likely you are to have lower accuracy. This is why you should always aim for more distinct categories, polarizing language, and simple classification decisions.

Resources

All of my examples have used NLTK, Python’s Natural Language ToolKit, which you can find at http://nltk.org/. I also train all my models using the scripts I created in nltk-trainer at https://github.com/japerk/nltk-trainer. To learn how to do text classification and sentiment analysis with NLTK yourself, I wrote a series of posts on my blog, starting with http://bit.ly/X9sqWR. And for those who want to go beyond basic text classification, take a look at scikit-learn, which is implementing all the latest and greatest machine learning algorithms in Python: http://scikit-learn.org/stable/. For Java people, there is Apache’s OpenNLP project at http://opennlp.apache.org/, and a commercial library called LingPipe, available at http://alias-i.com/lingpipe/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.6.77