In this chapter, we will set out to solve a common problem: determining whether customers are happy or not. We’ll approach this by understanding that happy customers generally say nice things while unhappy ones don’t. This is their sentiment.
There are an infinite amount of solutions to this problem, but this chapter will focus on just one that works well: support vector machines (SVMs). This algorithm uses decision boundaries to split data into multiple parts and operates well in higher dimensions due to feature transformation and ignoring distances between data points. We will discuss the normal testing methods we have laid out before, such as:
Cross-validation
Confusion matrix
Precision and recall
But we will also delve into a new way of improving models, known as feature transformation. In addition, we will discuss the possibilities of the following phenomena happening in a problem of sentiment analysis:
Entanglement
Unstable data
Correction cascade
Configuration debt
Our online store has two sets of customers, happy and unhappy. The happy customers return to the site consistently and buy from the company, while the unhappy customers are either window shoppers or spendthrifts who don’t care about the company or who are spending their money elsewhere. Our goals are to determine whether customer happiness correlates with our bottom line, and, down the line, to monitor their happiness.
But here exists a problem. How do we numerically say that a customer is happy or not? Unfortunately, there isn’t a field in our database explaining how happy our customers are. We know intuitively that happy customers are usually more likely to stay customers, but how can we test that?
There are two tiers to this problem:
We need to figure out whether customers are happy or not, or whether their sentiment is positive or negative in what they say.
Does overall customer sentiment correlate with our bottom line?
We also assume that a happy customer means more money, but is that actually true? How can we even build an algorithm to test something like that?
To start solving this two-tiered problem, we will figure a way to map customers to sentiment. There are many ways to approach this problem such as clustering customers into two groups or using KNN to find the closest neighbors to people we know are unhappy or happy. Or we could use SVMs.
To be able to map overall customer sentiment, we first need data to use. For our purposes we have a support system that allows us to export data that was written by our customers.
Thinking about our customers who have written to us many times in our support system, how would we determine whether they are happy or not? Ways to approach this include:
Have support agents tag each individual ticket with a sentiment (positive or negative).
Have support agents tag a subset of tickets (X% of all tickets).
Use an existing tagged database (such as movie reviews or some academic data set).
Now that we have data to classify, we can determine what algorithm to use. Since this chapter is about using SVMs, we are going to use that, although many other algorithms would work just as well. I’ve decided to use SVMs in this chapter, though, because they have the following properties:
They avoid the curse of dimensionality, meaning we can use lots of dimensions (features).
They have been shown to work well with sentiment analysis, which is pertinent to the issues discussed next.1
Let’s imagine we have data from our customers, in this case support tickets. In this example let’s say the customer is either happy or unhappy with the ticket (Figure 7-1).
Conceptually, if we were to build a model of what makes a customer happy or unhappy, we could take our inputs (in this case features from the text) and determine customer groupings. This would be very similar to KNN and would yield something like Figure 7-2.
This is a great idea, but it has a downside: textual features generally are high in number, which as we’ve discussed can incur the curse of dimensionality. For instance, given a set of support tickets there might be 4,000 dimensions, each defining whether they said a word in a corpus. So instead of relying on KNN, we should approach this model via a decision boundary.
If you were to look at these data points as a graphic, you might also think about splitting the data into two pieces by drawing a line down the middle. It’s obvious to us humans that this might yield a good solution. It also means that anything on one side of the line is unhappy while anything on the other is happy.
This idea is called a decision boundary method and there are many different algorithms in this category, including rules-based algorithms and decision trees.
Decision trees and random forests are types of decision boundary methods. If we were to plot the mushrooms on a n-dimensional plane, we could construct a boundary that splits the data into its various points.
But for sentiment analysis with 4,000 dimensions, given what we see here, how can we find the best boundary that splits the data into two parts?
To find the most optimal line between the two sets of data, imagine that we instead draw a margin between the two data pieces (Figure 7-3). If you could find the widest margin between the two data sets, then you would have solved the problem optimally and also found the solution that SVMs find.
This is what SVMs do: maximize the breadth of the margin between two (or more) classifications. The beauty of this algorithm is that it is computationally optimal because it maps to a quadratic program (a convex optimization).
But as you might notice I’m cheating by showing data that can be separated by a line. What about data where things aren’t so pretty?
What if our data isn’t linear? This is where a fundamental concept of improving and testing models comes into play. Instead of being forced to live in a coordinate system such as <x0,⋯, x1>, we can instead transform our data into a new coordinate system that is easier to solve. There are lots of ways of transforming features (which we will cover in later chapters) but one of them is called the kernel trick.
To understand it, here’s a riddle for you: in Figure 7-4, draw a straight line that separates the two circles.
Well, you can’t. That is, unless you think outside of the box, so to speak.
These look like regular circles, so there doesn’t appear to be a line that you could separate them with. This is true in 2D Cartesian coordinate systems, but if you project this into a 3D Cartesian coordinate system, < x,y > → <x2,√2xy,y2>, you will find that in fact this turns out to be linear (Figure 7-5).
Now you can see that these two circles are separate and you can draw a plane easily between the two. If you took that and mapped it back to the original plane, then there would in fact be a third circle in the middle that is a straight plane.
Next time you need a bar trick, try that out on someone.
This doesn’t just work for circles, but unfortunately the visualizations of four or more dimensions are confusing so I left them out. There are many different types of projections (or kernels) such as:
Polynomial kernel (heterogeneous and homogeneous)
Radial basis functions
Gaussian kernels
I do encourage you to read up more on kernels, although they will most likely distract us from the original intent of this section!
One downside to using kernels, though, is that you can easily overfit data. In a lot of ways they operate like splines. But one way to avoid overfitting is to introduce slack.
What if our data isn’t separable by a line? Luckily mathematicians have thought about this, and in mathematical optimization there’s a concept called “slack.” This idea introduces another variable that is minimized but reduces the worry of overfit. In practice, with SVMs the amount of slack is determined by a free parameter C, which could be thought of as a way to tell the algorithm how much slack to add or not (Figure 7-6).
As I discussed in Chapter 5, overfitting is a downfall of machine learning and inductive biases, so avoiding it is a good thing to do.
Okay, enough theory—let’s build a sentiment analyzer.
In this section, we’ll build a sentiment analyzer that determines the sentiment of movie reviews. The example we’ll use also applies to working with support tickets. We’ll first talk about what this tool will look like conceptually in a class diagram. Then, after identifying the pieces of the tool, we will build a Corpus
class, a CorpusSet
class, and a SentimentClassifier
class. The Corpus
and CorpusSet
classes involve transforming the text into numerical information. SentimentClassifier
is where we will then use the SVM algorithm to build this sentiment analyzer.
All of the code we are using for this example can be found on the thoughtfulml repository on GitHub.
Python is constantly changing, so the README file is the best place to get up to speed on running the examples.
There are no additional dependencies beyond a running Python version to run this example.
In this section we will be building three classes to support classifying incoming text to either positive or negative sentiment (Figure 7-7).
Corpus
This class will parse sentiment text and store as a corpus with frequencies in it.
CorpusSet
This is a collection of multiple corpora that each have a sentiment attached to it.
SentimentClassifier
Utilizes a CorpusSet
to train and classify sentiment.
Corpus, like corpse, means a body, but in this case it’s a body of writings. This word is used heavily in the natural-language processing community to signal a big group of previous writings that can be used to infer knowledge. In our example, we are using corpus to refer to a body of writings around a certain sentiment. Corpora is the plural of corpus.
Testing in SVMs primarily deals with setting a threshold of acceptance with accuracy and then tweaking the model until it works well enough. That is the concept we will apply in this chapter.
Besides the normal TDD affair of writing unit tests for our seams and building a solid code basis, there are additional testing considerations for SVMs:
Speed of training the model before and after configuration changes
Confusion matrix and precision and recall
Sensitivity analysis (correction cascades, configuration debt)
I will talk about these concerns through this section.
Our Corpus
class will handle the following:
Tokenizing text
Sentiment leaning, whether :negative
or :positive
Mapping from sentiment leaning to a numerical value
Returning a unique set of words from the corpus
When we write the seam test for this, we end up with the following:
import
unittest
from
io
import
StringIO
from
support_vector_machines.corpus
import
Corpus
class
TestCorpusSet
(
unittest
.
TestCase
):
def
setUp
(
self
):
self
.
negative
=
StringIO
(
'I hated that so much'
)
self
.
negative_corpus
=
Corpus
(
self
.
negative
,
'negative'
)
self
.
positive
=
StringIO
(
'loved movie!! loved'
)
self
.
positive_corpus
=
Corpus
(
self
.
positive
,
'positive'
)
def
test_trivial
(
self
):
"""consumes multiple files and turns it into sparse vectors"""
self
.
assertEqual
(
'negative'
,
self
.
negative_corpus
.
sentiment
)
def
test_tokenize1
(
self
):
"""downcases all the word tokens"""
self
.
assertListEqual
([
'quick'
,
'brown'
,
'fox'
],
Corpus
.
tokenize
(
'Quick Brown Fox'
))
def
test_tokenize2
(
self
):
"""ignores all stop symbols"""
self
.
assertListEqual
([
'hello'
],
Corpus
.
tokenize
(
'"
'
hello!?!?!.
'
" '
))
def
test_tokenize3
(
self
):
"""ignores the unicode space"""
self
.
assertListEqual
([
'hello'
,
'bob'
],
Corpus
.
tokenize
(
u
'hello
u00A0
bob'
))
def
test_positive
(
self
):
"""consumes a positive training set"""
self
.
assertEqual
(
'positive'
,
self
.
positive_corpus
.
sentiment
)
def
test_words
(
self
):
"""consumes a positive training set and unique set of words"""
self
.
assertEqual
({
'loved'
,
'movie'
},
self
.
positive_corpus
.
get_words
())
def
test_sentiment_code_1
(
self
):
"""defines a sentiment_code of 1 for positive"""
self
.
assertEqual
(
1
,
Corpus
(
StringIO
(
''
),
'positive'
)
.
sentiment_code
)
def
test_sentiment_code_minus1
(
self
):
"""defines a sentiment_code of 1 for positive"""
self
.
assertEqual
(
-
1
,
Corpus
(
StringIO
(
''
),
'negative'
)
.
sentiment_code
)
StringIO
makes strings look like IO objects, which makes it easy to test file IO–type operations on strings.
As you learned in Chapter 4, there are many different ways of tokenizing text, such as extracting out stems, frequency of letters, emoticons, and words. For our purposes, we will just tokenize words. These are defined as strings between nonalpha characters. So out of a string like “The quick brown fox” we would extract the, quick, brown, fox (Figure 7-8). We don’t care about punctuation and we want to be able to skip Unicode spaces and nonwords.
Writing the Corpus
class, we end up with:
import
io
import
re
class
Corpus
(
object
):
skip_regex
=
re
.
compile
(
r
'[
'
".?!]+'
)
space_regex
=
re
.
compile
(
r
's'
,
re
.
UNICODE
)
stop_words
=
[
x
.
strip
()
for
x
in
io
.
open
(
'data/stopwords.txt'
,
errors
=
'ignore'
)
.
readlines
()]
sentiment_to_number
=
{
'positive'
:
1
,
'negative'
:
-
1
}
@classmethod
def
tokenize
(
cls
,
text
):
cleared_text
=
cls
.
skip_regex
.
sub
(
''
,
text
)
parts
=
cls
.
space_regex
.
split
(
cleared_text
)
parts
=
[
part
.
lower
()
for
part
in
parts
]
return
[
part
for
part
in
parts
if
len
(
part
)
>
0
and
part
not
in
cls
.
stop_words
]
def
__init__
(
self
,
io
,
sentiment
):
self
.
_io
=
io
self
.
_sentiment
=
sentiment
self
.
_words
=
None
@property
def
sentiment
(
self
):
return
self
.
_sentiment
@property
def
sentiment_code
(
self
):
return
self
.
sentiment_to_number
[
self
.
_sentiment
]
def
get_words
(
self
):
if
self
.
_words
is
None
:
self
.
_words
=
set
()
for
line
in
self
.
_io
:
for
word
in
Corpus
.
tokenize
(
line
):
self
.
_words
.
add
(
word
)
self
.
_io
.
seek
(
0
)
return
self
.
_words
def
get_sentences
(
self
):
for
line
in
self
.
_io
:
yield
line
Now to create our next class, CorpusSet
.
The CorpusSet
class brings multiple corpora together and gives us a good basis to use SVMs:
import
unittest
from
io
import
StringIO
from
numpy
import
array
from
scipy.sparse
import
csr_matrix
from
support_vector_machines.corpus
import
Corpus
from
support_vector_machines.corpus_set
import
CorpusSet
class
TestCorpusSet
(
unittest
.
TestCase
):
def
setUp
(
self
):
self
.
positive
=
StringIO
(
'I love this country'
)
self
.
negative
=
StringIO
(
'I hate this man'
)
self
.
positive_corp
=
Corpus
(
self
.
positive
,
'positive'
)
self
.
negative_corp
=
Corpus
(
self
.
negative
,
'negative'
)
self
.
corpus_set
=
CorpusSet
([
self
.
positive_corp
,
self
.
negative_corp
])
def
test_compose
(
self
):
"""composes two corpuses together"""
self
.
assertEqual
({
'love'
,
'country'
,
'hate'
,
'man'
},
self
.
corpus_set
.
words
)
def
test_spars
(
self
):
"""returns a set of sparse vectors to train on"""
expected_ys
=
[
1
,
-
1
]
expected_xes
=
csr_matrix
(
array
(
[[
1
,
1
,
0
,
0
],
[
0
,
0
,
1
,
1
]]
))
self
.
corpus_set
.
calculate_sparse_vectors
()
ys
=
self
.
corpus_set
.
yes
xes
=
self
.
corpus_set
.
xes
self
.
assertListEqual
(
expected_ys
,
ys
)
self
.
assertListEqual
(
list
(
expected_xes
.
data
),
list
(
xes
.
data
))
self
.
assertListEqual
(
list
(
expected_xes
.
indices
),
list
(
xes
.
indices
))
self
.
assertListEqual
(
list
(
expected_xes
.
indptr
),
list
(
xes
.
indptr
))
To make these tests pass, we need to build a CorpusSet
class that takes in multiple corpora, transforms all of that into a matrix of features, and has the properties words
, xes
, and yes
(the latter for x’s and y’s).
Let’s start by building a CorpusSet
class:
import
numpy
as
np
from
scipy.sparse
import
csr_matrix
,
vstack
from
.corpus
import
Corpus
class
CorpusSet
(
object
):
def
__init__
(
self
,
corpora
):
self
.
_yes
=
None
self
.
_xes
=
None
self
.
_corpora
=
corpora
self
.
_words
=
set
()
for
corpus
in
self
.
_corpora
:
self
.
_words
.
update
(
corpus
.
get_words
())
@property
def
words
(
self
):
return
self
.
_words
@property
def
xes
(
self
):
return
self
.
_xes
@property
def
yes
(
self
):
return
self
.
_yes
This doesn’t do much except store all of the words in a set for later use. It does that by iterating the corpora and storing all the unique words. From here we need to calculate the sparse vectors we will use in the SVM, which depends on building a feature matrix composed of feature vectors:
class
CorpusSet
(
object
):
# __init__
# words
# xes
# yes
def
calculate_sparse_vectors
(
self
):
self
.
_yes
=
[]
self
.
_xes
=
None
for
corpus
in
self
.
_corpora
:
vectors
=
self
.
feature_matrix
(
corpus
)
if
self
.
_xes
is
None
:
self
.
_xes
=
vectors
else
:
self
.
_xes
=
vstack
((
self
.
_xes
,
vectors
))
self
.
_yes
.
extend
([
corpus
.
sentiment_code
]
*
vectors
.
shape
[
0
])
def
feature_matrix
(
self
,
corpus
):
data
=
[]
indices
=
[]
indptr
=
[
0
]
for
sentence
in
corpus
.
get_sentences
():
sentence_indices
=
self
.
_get_indices
(
sentence
)
indices
.
extend
(
sentence_indices
)
data
.
extend
([
1
]
*
len
(
sentence_indices
))
indptr
.
append
(
len
(
indices
))
feature_matrix
=
csr_matrix
((
data
,
indices
,
indptr
),
shape
=
(
len
(
indptr
)
-
1
,
len
(
self
.
_words
)),
dtype
=
np
.
float64
)
feature_matrix
.
sort_indices
()
return
feature_matrix
def
feature_vector
(
self
,
sentence
):
indices
=
self
.
_get_indices
(
sentence
)
data
=
[
1
]
*
len
(
indices
)
indptr
=
[
0
,
len
(
indices
)]
vector
=
csr_matrix
((
data
,
indices
,
indptr
),
shape
=
(
1
,
len
(
self
.
_words
)),
dtype
=
np
.
float64
)
return
vector
def
_get_indices
(
self
,
sentence
):
word_list
=
list
(
self
.
_words
)
indices
=
[]
for
token
in
Corpus
.
tokenize
(
sentence
):
if
token
in
self
.
_words
:
index
=
word_list
.
index
(
token
)
indices
.
append
(
index
)
return
indices
At this point we should have enough to validate our model using cross-validation. For that we will get into building the actual sentiment classifier as well as model validation.
Now we can get to writing the cross-validation unit test, which will determine how well our classification works. We do this by having two different tests. The first has an error rate of 35% or less and ensures that when it trains and validates on the same data, there is zero error:
from
fractions
import
Fraction
import
unittest
import
io
import
os
from
support_vector_machines.sentiment_classifier
import
SentimentClassifier
class
TestSentimentClassifier
(
unittest
.
TestCase
):
def
setUp
(
self
):
pass
def
test_validate
(
self
):
"""cross validates with an error of 35% or less"""
neg
=
self
.
split_file
(
'data/rt-polaritydata/rt-polarity.neg'
)
pos
=
self
.
split_file
(
'data/rt-polaritydata/rt-polarity.pos'
)
classifier
=
SentimentClassifier
.
build
([
neg
[
'training'
],
pos
[
'training'
]
])
c
=
2
**
7
classifier
.
c
=
c
classifier
.
reset_model
()
n_er
=
self
.
validate
(
classifier
,
neg
[
'validation'
],
'negative'
)
p_er
=
self
.
validate
(
classifier
,
pos
[
'validation'
],
'positive'
)
total
=
Fraction
(
n_er
.
numerator
+
p_er
.
numerator
,
n_er
.
denominator
+
p_er
.
denominator
)
(
total
)
self
.
assertLess
(
total
,
0.35
)
def
test_validate_itself
(
self
):
"""yields a zero error when it uses itself"""
classifier
=
SentimentClassifier
.
build
([
'data/rt-polaritydata/rt-polarity.neg'
,
'data/rt-polaritydata/rt-polarity.pos'
])
c
=
2
**
7
classifier
.
c
=
c
classifier
.
reset_model
()
n_er
=
self
.
validate
(
classifier
,
'data/rt-polaritydata/rt-polarity.neg'
,
'negative'
)
p_er
=
self
.
validate
(
classifier
,
'data/rt-polaritydata/rt-polarity.pos'
,
'positive'
)
total
=
Fraction
(
n_er
.
numerator
+
p_er
.
numerator
,
n_er
.
denominator
+
p_er
.
denominator
)
(
total
)
self
.
assertEqual
(
total
,
0
)
In the second test we use two utility functions, which could also be achieved using either scikit-learn or other packages:
class
TestSentimentClassifier
(
unittest
.
TestCase
):
# setUp
# test_validate
# test_validate_itself
def
validate
(
self
,
classifier
,
file
,
sentiment
):
total
=
0
misses
=
0
with
(
open
(
file
,
errors
=
'ignore'
))
as
f
:
for
line
in
f
:
if
classifier
.
classify
(
line
)
!=
sentiment
:
misses
+=
1
total
+=
1
return
Fraction
(
misses
,
total
)
def
split_file
(
self
,
filepath
):
ext
=
os
.
path
.
splitext
(
filepath
)[
1
]
counter
=
0
training_filename
=
'tests/fixtures/training
%s
'
%
ext
validation_filename
=
'tests/fixtures/validation
%s
'
%
ext
with
(
io
.
open
(
filepath
,
errors
=
'ignore'
))
as
input_file
:
with
(
io
.
open
(
validation_filename
,
'w'
))
as
val_file
:
with
(
io
.
open
(
training_filename
,
'w'
))
as
train_file
:
for
line
in
input_file
:
if
counter
%
2
==
0
:
val_file
.
write
(
line
)
else
:
train_file
.
write
(
line
)
counter
+=
1
return
{
'training'
:
training_filename
,
'validation'
:
validation_filename
}
What this test does is validate that our model has a high enough accuracy to be useful.
Now we need to write our SentimentClassifier
, which involves building a class that will respond to:
build
This class method will build a SentimentClassifier
off of files instead of a CorpusSet
.
present_answer
This method will take the numerical representation and output something useful to the end user.
c
This returns the C parameter that determines how wide the error bars are on SVMs.
reset_model
This resets the model.
words
This returns words.
fit_model
This does the big lifting and calls into the SVM library that scikit-learn wrote.
classify
This method classifies whether the string is negative or positive sentiment.
import
io
import
os
from
numpy
import
ndarray
from
sklearn
import
svm
from
.corpus
import
Corpus
from
.corpus_set
import
CorpusSet
class
SentimentClassifier
(
object
):
ext_to_sentiment
=
{
'.pos'
:
'positive'
,
'.neg'
:
'negative'
}
number_to_sentiment
=
{
-
1
:
'negative'
,
1
:
'positive'
}
@classmethod
def
present_answer
(
cls
,
answer
):
if
isinstance
(
answer
,
ndarray
):
answer
=
answer
[
0
]
return
cls
.
number_to_sentiment
[
answer
]
@classmethod
def
build
(
cls
,
files
):
corpora
=
[]
for
file
in
files
:
ext
=
os
.
path
.
splitext
(
file
)[
1
]
corpus
=
Corpus
(
io
.
open
(
file
,
errors
=
'ignore'
),
cls
.
ext_to_sentiment
[
ext
])
corpora
.
append
(
corpus
)
corpus_set
=
CorpusSet
(
corpora
)
return
SentimentClassifier
(
corpus_set
)
def
__init__
(
self
,
corpus_set
):
self
.
_trained
=
False
self
.
_corpus_set
=
corpus_set
self
.
_c
=
2
**
7
self
.
_model
=
None
@property
def
c
(
self
):
return
self
.
_c
@c.setter
def
c
(
self
,
cc
):
self
.
_c
=
cc
def
reset_model
(
self
):
self
.
_model
=
None
def
words
(
self
):
return
self
.
_corpus_set
.
words
def
classify
(
self
,
string
):
if
self
.
_model
is
None
:
self
.
_model
=
self
.
fit_model
()
prediction
=
self
.
_model
.
predict
(
self
.
_corpus_set
.
feature_vector
(
string
))
return
self
.
present_answer
(
prediction
)
def
fit_model
(
self
):
self
.
_corpus_set
.
calculate_sparse_vectors
()
y_vec
=
self
.
_corpus_set
.
yes
x_mat
=
self
.
_corpus_set
.
xes
clf
=
svm
.
SVC
(
C
=
self
.
c
,
cache_size
=
1000
,
gamma
=
1.0
/
len
(
y_vec
),
kernel
=
'linear'
,
tol
=
0.001
)
clf
.
fit
(
x_mat
,
y_vec
)
return
clf
Up until this point we have discussed how to build the model but not about how to tune or verify the model. This is where a confusion matrix, precision, recall, and sensitivity analysis come into play.
Now that we have a model that calculates sentiment from text, there’s an additional issue of how to take multiple tickets per customer and map them to one measure of sentiment. There are a few ways of doing this:
Mode
Average (which would yield a score between –1 and 1)
Exponential moving average
Each has benefits and downsides, so to explain the differences, let’s take an example of a few customers with different sentiments (Table 7-1).
Sequence number | Alice | Bob | Terry |
---|---|---|---|
1 |
1 |
–1 |
1 |
2 |
1 |
–1 |
1 |
3 |
1 |
–1 |
1 |
4 |
1 |
–1 |
–1 |
5 |
1 |
–1 |
–1 |
6 |
1 |
–1 |
1 |
7 |
–1 |
–1 |
1 |
8 |
–1 |
1 |
1 |
9 |
–1 |
1 |
1 |
10 |
–1 |
1 |
1 |
In general you can expect customers to change their minds over time. Alice was positive to begin with but became negative in her sentiment. Bob was negative in the beginning but became positive towards the end, and Terry was mostly positive but had some negative sentiment in there.
This brings up an interesting implementation detail. If we map these data to either a mode or average, then we will weight heavily things that are irrelevant. Alice is unhappy right now, while Bob is happy right now.
Mode and average are both fast implementations but there is another method called exponential weighted moving average or EWMA for short.
Exponential moving averages are used heavily in finance since they weight recent data much heavier than old data. Things change quickly in finance and people can change as well. Unlike a normal average, this aims to change the weights from 1⁄N to some function that is based on a free parameter α, which tunes how much weight to give to the past.
So instead of the formula for a simple average being:
we would use the formula:
This can be transformed into a recursive formula:
Getting back to our original question on how to implement this let’s look at the mode, average, and EWMA together (Table 7-2).
Name | Mode | Average | EWMA (α = 0.94) |
---|---|---|---|
Alice |
1 |
0.2 |
–0.99997408 |
Bob |
–1 |
–0.4 |
0.999568 |
Terry |
1 |
0.6 |
0.99999845 |
As you can see EWMA maps our customers much better than a plain average or mode does. Alice is negative right now, Bob is positive now, and Terry has always been mostly positive.
We’ve been able to build a model that takes textual data and splits it into two sentiment categories, either positive or negative. This is great! But it doesn’t quite solve our problem, which originally was determining whether our customers were unhappy or not.
There is a certain amount of bias that one needs to avoid here: just because we have been able to map sentiment successfully into a given piece of text doesn’t mean that we can tell whether the customer is happy or not. Causation isn’t correlation, as they say, and vice versa.
But what can we do instead?
We can learn from this and understand our customers better, and also feed this data into other important algorithms, such as whether sentiment of text is correlated with more value from the customer or not (e.g., Figure 7-9).
This information is useful to running a business and improves our understanding of the data.
The SVM algorithm is very well suited to classifying two separable classes. It can be modified to separate more than two classes and doesn’t suffer from the curse of dimensionality that KNN does. This chapter taught you how SVM can be used to separate happy and unhappy customers, as well as how to assign sentiment to movie data.
But more importantly, we’ve thought about how to go about testing our intuition of whether happy customers yield more money for our business.
1 Gaurangi Patil et al., “Sentiment Analysis Using Support Vector Machine,” International Journal of Innovative Research in Computer and Communication Engineering 2, no. 1 (2014), http://ijircce.com/upload/2014/january/16K_Sentiment.pdf.
18.224.59.50