Did you know that people lie for their own selfish reasons? Even if this is totally obvious to you, you may be surprised at how blatant this practice has become online, to the point where some people will explain their reasons for lying immediately after doing so.
I knew unethical people would lie in online reviews in order to inflate ratings or attack competitors, but what I didn’t know, and only learned by accident, is that individuals will sometimes write reviews that completely contradict their associated rating, without any regard to how it affects a business’s online reputation. And often this is for businesses that an individual likes.
How did I learn this? By using ratings and reviews to create a sentiment corpus, I trained a sentiment analysis classifier that could reliably determine the sentiment of a review. While evaluating this classifier, I discovered that it could also detect discrepancies between the review sentiment and the corresponding rating, thereby finding liars and confused reviewers. Here’s the whole story of how I used text classification to identify an unexpected source of bad data...
At my company, Weotta,[8] we produce applications and APIs for navigating local data in ways that people actually care about, so we can answer questions like: Is there a kid-friendly restaurant nearby? What’s the nearest hip yoga studio? What concerts are happening this weekend?
To do this, we analyze, aggregate, and organize local data in order to classify it along dimensions that we can use to answer these questions. This classification process enables us to know which restaurants are classy, which bars are divey, and where you should go on a first date. Online business reviews are one of the major input signals we use to determine these classifications. Reviews can tell us the positive or negative sentiment of the reviewer, as well as what they specifically care about, such as quality of service, ambience, and value. When we aggregate reviews, we can learn what’s popular about the place and why people like or dislike it. We use many other signals besides reviews, but with the proper application of natural language processing,[9] reviews are a rich source of significant information.
To get reviews, we use APIs where possible, but most reviews are found using good old-fashioned web scraping. If you can use an API like CityGrid[10] to get the data you need, it will make your life much easier, because while scraping isn’t necessarily difficult, it can be very frustrating. Website HTML can change without notice, and only the simplest or most advanced scraping logic will remain unaffected. But the majority of web scrapers will break on even the smallest of HTML changes, forcing you to continually monitor and maintain your scrapers. This is the dirty secret of web mining: the end result might be nice and polished data, but the process is more akin to janitorial work where every mess is unique and it never stays clean for long.
Once you’ve got reviews, you can aggregate ratings to calculate an
average rating for a business. One problem is that many sources don’t
include ratings with their reviews. So how can you accurately calculate an
average rating? We wanted to do this for our data, as well as aggregate
the overall positive sentiment from all the reviews for a business,
independent of any average rating. With that in mind, I figured I could create a sentiment
classifier,[11] using rated reviews as a training corpus. A classifier works by taking a
feature set and determining a
label. For sentiment analysis, a feature set is a
piece of text, like a review, and the possible labels can be pos
for positive text, and neg
for negative text. Such a sentiment
classifier could be run over a business’s reviews in order to calculate an
overall sentiment, and to make up for any missing rating
information.
NLTK,[12] Python’s Natural Language ToolKit, is a very useful programming
library for doing natural language processing and text
classification.[13] It also comes with many corpora that you can use for training
and testing. One of these is the movie_reviews
corpus,[14] and if you’re just learning how to do sentiment
classification, this is a good corpus to start with. It is organized into
two directories, pos
and neg
. In each directory is a set of files
containing movie reviews, with every review separated by a blank line.
This corpus was created by Pang and Lee,[15] and they used ratings that came with each review to decide
whether that review belonged in pos
or
neg
. So in a 5-star rating system, 3.5
stars and higher reviews went into the pos
directory, while 2.5 stars and lower reviews
went into the neg
directory. The
assumption behind this is that high rated reviews will have positive
language, and low rated reviews will have more negative language.
Polarized language is ideal for text classification, because the
classifier can learn much more precisely those words that indicate
pos
and those words that indicate
neg
.
Because I needed sentiment analysis for local businesses, not
movies, I used a similar method to create my own sentiment training corpus
for local business reviews. From a selection of businesses, I produced a
corpus where the pos
text came from 5
star reviews, and the neg
text came
from 1 star reviews. I actually started by using both 4 and 5 star reviews
for pos
, and 1 and 2 star reviews for
neg
, but after a number of training
experiments, it was clear that the 2 and 4 star reviews had less
polarizing language, and therefore introduced too much noise, decreasing
the accuracy of the classifier. So my initial assumption was correct,
though the implementation of it was not ideal. But because I created the
training data, I had the power to change it in order to yield a more
effective classifier. All training-based machine learning methods work on
the general principle of “garbage in, garbage out,” so if your training
data is no good, do whatever you can to make it better before you start
trying to get fancy with algorithms.
To illustrate the power of polarized language, what follows is
a table showing some of the most polarized words used in the movie_reviews
corpus, along with the occurrence
count in each category, and the Chi-Squared information gain, calculated
using NLTK’s BigramAssocMeasures.chi_sq()
function in the
nltk.metrics.association
module.[16] This Chi-Square metric is a modification of the Phi-Square
measure of the association between two variables, in this case a word and
a category. Given the number of times a word appears in the pos
category, the total number of times it
occurs in all categories, the number of words in the pos
category, and the total number of words in
all categories, we get an association score between a word and the
pos
category. Because we only have two
categories, the pos
and neg
scores for a given word will be the same
because the word is equally significant either way, but the interpretation
of that significance depends on the word’s relative frequency in each
category.
Word | Pos | Neg | Chi-Sq |
---|---|---|---|
bad | 361 | 1034 | 399 |
worst | 49 | 259 | 166 |
stupid | 45 | 208 | 123 |
life | 1057 | 529 | 126 |
boring | 52 | 218 | 120 |
truman | 152 | 11 | 108 |
why | 317 | 567 | 99 |
great | 751 | 397 | 76 |
war | 275 | 94 | 71 |
awful | 21 | 111 | 71 |
Some of the above words and numbers should be mostly obvious, but others not so much. For example, many people might think “war” is bad, but clearly that doesn’t apply to movies. And people tend to ask “why” more often in negative reviews compared to positive reviews. Negative adjectives are also more common, or at least provide more information for classification than the positive adjectives. But there can still be category overlap with these adjectives, such as “not bad” in a positive review, or “not great” in a negative review. Let’s compare this to some of the more common and less polarizing words:
Word | Pos | Neg | Chi-Sq |
---|---|---|---|
good | 1248 | 1163 | 0.6361 |
crime | 115 | 93 | 0.6180 |
romantic | 137 | 117 | 0.1913 |
movies | 635 | 571 | 0.0036 |
hair | 57 | 52 | 0.0033 |
produced | 67 | 61 | 0.0026 |
bob | 96 | 86 | 0.0024 |
where | 785 | 707 | 0.0013 |
face | 188 | 169 | 0.0013 |
take | 476 | 429 | 0.0003 |
You can see from this that “good” is one of the most overused adjectives and confers very little information. And you shouldn’t name a movie character “Bob” if you want a clear and strong audience reaction. When training a text classifier, these low-information words are harmful noise, and as such should either be discarded or weighted down, depending on the classification algorithm you use. Naive Bayes[17] in particular does not do well with noisy data, while a Logistic Regression classifier (also known as Maximum Entropy)[18] can weigh noisy features down to the point of insignificance.
Here’s a little more detail on the corpus creation: we counted
each review as a single instance. The simplest way to
do this is to replace all newlines in a review with a single space,
thereby ensuring that each review looks like one paragraph. Then separate
each review/paragraph by a blank line, so that it is easy to identify when
one review ends and the next begins. Depending on the number of reviews
per category, you may want to have multiple files per category, each
containing some reasonable number of reviews separated by blank lines.
With multiple files, you should either have separate directories for
pos
and neg
, like the movie_reviews
corpus, or you could use easily
identified filename patterns. The simplest way to do it is to copy
something that already exists, so you can reuse any code that recognizes
that organizational pattern. The number of reviews per file is up to you;
what really matters is the number of reviews per category. Ideally you
want at least 1000 reviews in each category; I try to aim for at least
10,000, if possible. You want enough reviews to reduce the bias of any
individual reviewer or item being reviewed, and to ensure that you get a
good number of significant words in each category, so the classifier can
learn effectively.
The other thing you need to be concerned about is category balance.
After producing a corpus of 1 and 5 star reviews, I had to limit the
number of pos
reviews significantly in
order to balance the pos
and neg
categories, because it turns out that
there’s far more 5 star reviews than there are 1 star reviews. It seems
that online, most businesses are above average, as you can see in this
chart showing the percentage of each rating.
Rating | Percent |
---|---|
5 | 32% |
4 | 35% |
3 | 17% |
2 | 9% |
1 | 7% |
People are clearly biased towards higher rated reviews; there are
nearly five times as many 5 star reviews as 1 star reviews. So it might
make sense that a sentiment classifier should be biased the same way, and
all else being equal, favor pos
classifications over neg
. But there’s a
design problem here: if a sentiment classifier is more biased towards the
pos
class, it will produce more false
positives. And if you plan on surfacing these positive reviews, showing
them to normal people that have no insight into how a sentiment classifier
works, you really don’t want to show a false positive review. There’s a
lot of cognitive dissonance when you claim that a business is highly rated
and most people like it, while at the same time showing a negative review.
One of the worst things you can do when designing a user interface is to
show conflicting messages at the same time. So by balancing the pos
and neg
categories, I was able to reduce that bias and decrease false positives.
This was accomplished by simply pruning the number of pos
reviews until it was equal to the number of
neg
reviews.
Now that I had a polarized and balanced training corpus, it was
trivial to train a classifier using a classifier training script from
nltk-trainer.[19] nltk-trainer is an open source library of scripts I created
for training and analyzing NLTK models. For text classification, the
appropriate script is train_classifier.py
. Just a few hours of
experimentation lead to a highly accurate classifier. Below is an example
of how to use train_classifier.py
, and
the kind of stats I saw:
nltk-trainer$ ./train_classifier.py review_sentiment --no-pickle --classifier MEGAM --ngrams 1 --ngrams 2 --instances paras --fraction 0.75 loading review_sentiment 2 labels: ['neg', 'pos'] 22500 training feats, 7500 testing feats [Found megam: /usr/local/bin/megam] training MEGAM classifier accuracy: 0.913465 neg precision: 0.891415 neg recall: 0.931725 pos precision: 0.947058 pos recall: 0.910265
With these arguments, I’m using the MEGAM
algorithm for training a MaxentClassifier
using each review paragraph as
a single instance, looking at both single words (unigrams) and pairs of
words (bigrams). The MaxentClassifier
(or Logistic Regression), uses an iterative algorithm to determine weights
for every feature. These weights can be positive or negative for a
category, meaning that the presence of a word can imply that a feature set
belongs to a category and/or that a feature set does not belong to different category. So
referring to the previous word tables, we can expect that “worst” will
have a positive weight for the neg
category, and a negative weight for the pos
category. The MEGAM
algorithm is just one of many available
training algorithms, and I prefer it for its speed, memory efficiency, and
slight accuracy advantage over the other available algorithms.
The other options used above are --no-pickle
, which means to not save the trained
classifier to disk, and --fraction
,
which specifies how much of the corpus is used for training, with the
remaining fraction used for testing. train_classifier.py
has many other options,
which you can see by using the --help
option. These include various algorithm-specific training options, what
constitutes an instance, which ngrams to use, and
many more.
If you’re familiar with classification algorithms, you may be wondering why I didn’t use Naive Bayes. This is because my tests showed that Naive Bayes was much less accurate than Maxent, and that even combining the two algorithms did not beat Maxent by itself. Naive Bayes does not weight its features, and therefore tends to be susceptible to noisy data, which I believe is the reason it did not perform too well in this case. But your data is probably different, and you may find opposite results when you conduct your experiments.
I actually wrote the original code behind train_classifier.py
for this project so that I
could design and modify classifier training experiments very quickly.
Instead of copy and paste coding and endless script modifications, I was
able to simply tweak command line arguments to try out a different set of
training parameters. I encourage you to do the same, and to perform many
training experiments in order to arrive at the best possible set of
options.
After I’d created this script for text classification, I added training scripts for part-of-speech tagging[20] and chunking,[21] leading to the creation of the whole nltk-trainer project and its suite of training and analysis scripts. I highly recommend trying these out before attempting to create a custom NLTK based classifier, or any NLTK model, unless you really want to know how the code works, and/or have custom feature extraction methods you want to use.
But back to the sentiment classifier: no matter what the
statistics say, over the years I’ve learned to not fully trust trained
models and their input data. Unless every training instance has been
hand-verified by three professional reviewers, you can assume there’s some
noise and/or inaccuracy in your training data. So once I had trained what
appeared to be a highly accurate sentiment classifier, I ran it over my
training corpus in order to see if I could find reviews that were
misclassified by the classifier. My goal was to figure out where the
classifier went wrong, and perhaps get some insight into how to tweak the
training parameters for better results. To my surpise, I found reviews
like this in the pos/5-star category, which the classifier was classifying
as neg
:
It was loud and the wine by the glass is soo expensive. Thats the only negative because it was good.
And in the neg/1-star category, there were pos
reviews like this:
One of the best places in New York for a romantic evening. Great food and service at fair prices.
We loved it! The waiters were great and the food came quickly and it was delicious. 5 star for us!
The classifier actually turned out to be more accurate at detecting sentiment than the ratings used to create the training corpus! Apparently, one of the many bizarre things people do online is write reviews that completely contradict their rating. While trying to create a sentiment classifier, I had accidentally created a way to identify both liars and the confused. Here’s some 1-star reviews by blatant liars:
This is the best BBQ ever! I’m not just saying that to keep you fools from congesting my favorite places.
Quit coming to my favorite Karaoke spot. I found it first.
While these at least have some logic behind them, I completely disagree with the motivation. By giving 1 star with their review, these reviewers are actively harming the reputation of their favorite businesses for their own selfish short-term gain.
On the other side of the sentiment divide, here’s a mixed sentiment comment from a negative 5-star review:
This place sucks, do not come here, dirty, unfriendly staff and bad workout equipment. MY club, do you hear me, MY, MY, MY club. STAY AWAY! One of the best clubs in the bay area. All jokes aside, this place is da bomb.
This kind of negative 5-star review could also be harming the business’s reputation. The first few sentences may be a joke, but those are also the sentences people are more likely to read, and this review is saying some pretty negative things, albeit jokingly. And then there’s people that really shouldn’t be writing reviews:
I like to give A’s. I dont want to hurt anyones feelings. A- is the lowest I like to give. A- is the new F.
The whole point of reviews and ratings is to express your opinion, and yet this reviewer seems afraid to do just that. And here’s an actual negative opinion from a 5-star review:
My steak was way over-cooked. The menu is very limited. Too few choices
If the above review came with a 1 or 2 star rating, that’d make sense. But a 5-star rating for a limited menu and overcooked steak? I’m not the only one who’s confused.
Finally, just to show that my sentiment classifier isn’t perfect, here’s a 5-star review that’s actually positive, but the reviewer uses a double negative, which causes the classifier to give it a negative sentiment:
Never had a disappointing meal there.
Double negatives, negations such as “not good,” sarcasm, and other language idioms can often confuse sentiment analysis systems and are an area of ongoing research. Because of this, I believe that it’s best to exclude such reviews from any metrics about a business. If you want a clear signal, you often have to ignore small bits of contradictory information.
We use the sentiment classifier in another way, too. As I mentioned earlier, we show reviews of places to provide our users with additional context and confirmation. And because we try to show only the best places, the reviews we show should reflect that. This means that every review we show needs to have a strong positive signal. And if there’s a rating included with the review, it needs to be high too, because we don’t want to show any confused high-rated reviews or duplicitous low-rated reviews. Otherwise, we’d just confuse our own users.
Before adding the sentiment classifier as a critical component of our review selection method, we were simply choosing reviews based on rating. And when we didn’t have a rating, we were choosing the most recent reviews. Neither of these methods was satisfactory. As I’ve shown above, you cannot always trust ratings to accurately reflect the sentiment of a review. And for reviews without a rating, anyone could say anything and we had no signal to use for filtering out the negative reviews. Now some might think we should be showing negative reviews to provide a balanced view of a business. But our goal in this case is not to create a research tool—there’s plenty of other sites and apps that are already great for that. Our goal is to show you the best, most relevant places for your occasion. If every other signal is mostly positive, then showing negative reviews is a disservice to our users and results in a poor experience. By choosing to show only positive reviews, the data, design, and user experience are all congruent, helping our users choose from the best options available based on their own preferences, without having to do any mental filtering of negative opinions.
One important lesson for machine learning and statistical
natural language processing enthusiasts: it’s very important to train your
own models on your own data. If I had used classifiers trained on the
standard movie_reviews
corpus, I would
never have gotten these results. Movie reviews are simply different than
local business reviews. In fact, it might be the case that you’d get even
better results by segmenting businesses by type, and creating classifiers
for each type of business. I haven’t run this experiment yet, but it might
lead to interesting research. The point is, your models should be trained
on the same kind of data they need to analyze if you want high accuracy
results. And when it comes to text classification and sentiment analysis
in particular, the domain really matters. That requires creating a custom
corpus [22] and spending at least a few hours on experiments and
research to really learn about your data in order to produce good
models.
You must then take a critical look at your training data, and validate your training models against it. This is the only way to know what your model is actually learning, and if your training data is any good. If I hadn’t done any model validation, I would never have discovered these bad reviews, nor realized that my sentiment classifier could detect inconsistent opinions and outright lying. In a sense, these bad reviews are a form of noise that has been maliciously injected into the data. So ask yourself, what forms of bad data might be lurking in your data stream?
The process I went through can be summarized as:
Get relevant data.
Create a custom training corpus.
Train a model.
Validate that model against the training corpus.
Discover something interesting.
At steps 3-5, you may find that your training corpus is not good enough. It could mean you need to get more relevant data. Or that the data you have is too noisy. In my case, I found that 2 and 4 star reviews were not polarizing enough, and that there was an imbalance between the number of 5-star reviews and the number of 1-star reviews.
It’s also possible that your expectations for machine learning are too high, and you need to simplify the problem. Natural language processing and machine learning are imperfect methods that rely on statistical pattern matching. You cannot expect 100% accuracy, and the noisier the data is, the more likely you are to have lower accuracy. This is why you should always aim for more distinct categories, polarizing language, and simple classification decisions.
All of my examples have used NLTK, Python’s Natural Language ToolKit, which you can find at http://nltk.org/. I also train all my models using the scripts I created in nltk-trainer at https://github.com/japerk/nltk-trainer. To learn how to do text classification and sentiment analysis with NLTK yourself, I wrote a series of posts on my blog, starting with http://bit.ly/X9sqWR. And for those who want to go beyond basic text classification, take a look at scikit-learn, which is implementing all the latest and greatest machine learning algorithms in Python: http://scikit-learn.org/stable/. For Java people, there is Apache’s OpenNLP project at http://opennlp.apache.org/, and a commercial library called LingPipe, available at http://alias-i.com/lingpipe/.
3.15.6.77