Taking the word types into account

So far our hope was to simply use the words independent of each other with the hope that a bag-of-words approach would suffice. Just from our intuition, however, neutral tweets probably contain a higher fraction of nouns, while positive or negative tweets are more colorful, requiring more adjectives and verbs. What if we could use this linguistic information of the tweets as well? If we could find out how many words in a tweet were nouns, verbs, adjectives, and so on, the classifier could maybe take that into account as well.

Determining the word types

Determining the word types is what part of speech (POS) tagging is all about. A POS tagger parses a full sentence with the goal to arrange it into a dependence tree, where each node corresponds to a word and the parent-child relationship determines which word it depends on. With this tree, it can then make more informed decisions; for example, whether the word "book" is a noun ("This is a good book.") or a verb ("Could you please book the flight?").

You might have already guessed that NLTK will also play a role also in this area. And indeed, it comes readily packaged with all sorts of parsers and taggers. The POS tagger we will use, nltk.pos_tag(), is actually a full-blown classifier trained using manually annotated sentences from the Penn Treebank Project (http://www.cis.upenn.edu/~treebank). It takes as input a list of word tokens and outputs a list of tuples, each element of which contains the part of the original sentence and its part of speech tag:

>>> import nltk
>>> nltk.pos_tag(nltk.word_tokenize("This is a good book."))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book', 'NN'), ('.', '.')]
>>> nltk.pos_tag(nltk.word_tokenize("Could you please book the flight?"))
[('Could', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('book', 'NN'), ('the', 'DT'), ('flight', 'NN'), ('?', '.')]

The POS tag abbreviations are taken from the Penn Treebank Project (adapted from http://americannationalcorpus.org/OANC/penn.html):

POS tag

Description

Example

CC

coordinating conjunction

or

CD

cardinal number

2 second

DT

determiner

the

EX

existential there

there are

FW

foreign word

kindergarten

IN

preposition/subordinating conjunction

on, of, like

JJ

adjective

cool

JJR

adjective, comparative

cooler

JJS

adjective, superlative

coolest

LS

list marker

1)

MD

modal

could, will

NN

noun, singular or mass

book

NNS

noun plural

books

NNP

proper noun, singular

Sean

NNPS

proper noun, plural

Vikings

PDT

predeterminer

both the boys

POS

possessive ending

friend's

PRP

personal pronoun

I, he, it

PRP$

possessive pronoun

my, his

RB

adverb

however, usually, naturally, here, good

RBR

adverb, comparative

better

RBS

adverb, superlative

best

RP

particle

give up

TO

to

to go, to him

UH

interjection

uhhuhhuhh

VB

verb, base form

take

VBD

verb, past tense

took

VBG

verb, gerund/present participle

taking

VBN

verb, past participle

taken

VBP

verb, singular, present, non-3D

take

VBZ

verb, third person singular, present

takes

WDT

wh-determiner

which

WP

wh-pronoun

who, what

WP$

possessive wh-pronoun

whose

WRB

wh-abverb

where, when

With these tags it is pretty easy to filter the desired tags from the output of pos_tag(). We simply have to count all the words whose tags start with NN for nouns, VB for verbs, JJ for adjectives, and RB for adverbs.

Successfully cheating using SentiWordNet

While the linguistic information that we discussed earlier will most likely help us, there is something better we can do to harvest it: SentiWordNet (http://sentiwordnet.isti.cnr.it). Simply put, it is a 13 MB file that assigns most of the English words a positive and negative value. In more complicated words, for every synonym set, it records both the positive and negative sentiment values. Some examples are as follows:

POS

ID

PosScore

NegScore

SynsetTerms

Description

a

00311354

0.25

0.125

studious#1

Marked by care and effort; "made a studious attempt to fix the television set"

a

00311663

0

0.5

careless#1

Marked by lack of attention or consideration or forethought or thoroughness; not careful

n

03563710

0

0

implant#1

A prosthesis placed permanently in tissue

v

00362128

0

0

kink#2 curve#5 curl#1

Form a curl, curve, or kink; "the cigar smoke curled up at the ceiling"

With the information in the POS column, we will be able to distinguish between the noun "book" and the verb "book". PosScore and NegScore together will help us to determine the neutrality of the word, which is 1-PosScore-NegScore. SynsetTerms lists all words in the set that are synonyms. The ID and Description can be safely ignored for our purpose.

The synset terms have a number appended, because some occur multiple times in different synsets. For example, "fantasize" conveys two quite different meanings, also leading to different scores:

POS

ID

PosScore

NegScore

SynsetTerms

Description

v

01636859

0.375

0

fantasize#2 fantasise#2

Portray in the mind; "he is fantasizing the ideal wife"

v

01637368

0

0.125

fantasy#1 fantasize#1 fantasise#1

Indulge in fantasies; "he is fantasizing when he says that he plans to start his own company"

To find out which of the synsets to take, we would have to really understand the meaning of the tweets, which is beyond the scope of this chapter. The field of research that focuses on this challenge is called word sense disambiguation. For our task, we take the easy route and simply average the scores over all the synsets in which a term is found. For "fantasize", PosScore would be 0.1875 and NegScore would be 0.0625.

The following function, load_sent_word_net(), does all that for us, and returns a dictionary where the keys are strings of the form "word type/word", for example "n/implant", and the values are the positive and negative scores:

import csv, collections
def load_sent_word_net():

    sent_scores = collections.defaultdict(list)

    with open(os.path.join(DATA_DIR, 
     SentiWordNet_3.0.0_20130122.txt"), "r") as csvfile:

        reader = csv.reader(csvfile, delimiter='	',
                 quotechar='"')
        for line in reader:
            if line[0].startswith("#"):
                continue
            if len(line)==1:
                continue

            POS,ID,PosScore,NegScore,SynsetTerms,Gloss = line
            if len(POS)==0 or len(ID)==0:
                continue
            #print POS,PosScore,NegScore,SynsetTerms
            for term in SynsetTerms.split(" "):
                # drop number at the end of every term
                term = term.split("#")[0] 
                term = term.replace("-", " ").replace("_", " ")
                key = "%s/%s"%(POS,term.split("#")[0])
                sent_scores[key].append((float(PosScore),
                float(NegScore)))
    for key, value in sent_scores.iteritems():
        sent_scores[key] = np.mean(value, axis=0)

    return sent_scores

Our first estimator

Now we have everything in place to create our first vectorizer. The most convenient way to do it is to inherit it from BaseEstimator. It requires us to implement the following three methods:

  • get_feature_names(): This returns a list of strings of the features that we will return in transform().
  • fit(document, y=None): As we are not implementing a classifier, we can ignore this one and simply return self.
  • transform(documents): This returns numpy.array(), containing an array of shape (len(documents), len(get_feature_names)). This means that for every document in documents, it has to return a value for every feature name in get_feature_names().

Let us now implement these methods:

sent_word_net = load_sent_word_net()

class LinguisticVectorizer(BaseEstimator):
    def get_feature_names(self):
        return np.array(['sent_neut', 'sent_pos', 'sent_neg',
         'nouns', 'adjectives', 'verbs', 'adverbs',
         'allcaps', 'exclamation', 'question', 'hashtag','mentioning'])

    # we don't fit here but need to return the reference
    # so that it can be used like fit(d).transform(d)
    def fit(self, documents, y=None):
        return self

    def _get_sentiments(self, d):
        
        sent = tuple(d.split())
        tagged = nltk.pos_tag(sent)

        pos_vals = []
        neg_vals = []

        nouns = 0.
        adjectives = 0.
        verbs = 0.
        adverbs = 0.

        for w,t in tagged:
            p, n = 0,0
            sent_pos_type = None
            if t.startswith("NN"):
                sent_pos_type = "n"
                nouns += 1
            elif t.startswith("JJ"):
                sent_pos_type = "a"
                adjectives += 1
            elif t.startswith("VB"):
                sent_pos_type = "v"
                verbs += 1
            elif t.startswith("RB"):
                sent_pos_type = "r"
                adverbs += 1

            if sent_pos_type is not None:
                sent_word = "%s/%s"%(sent_pos_type, w)

                if sent_word in sent_word_net:
                    p,n = sent_word_net[sent_word]

            pos_vals.append(p)
            neg_vals.append(n)

        l = len(sent)
        avg_pos_val = np.mean(pos_vals)
        avg_neg_val = np.mean(neg_vals)
        return [1-avg_pos_val-avg_neg_val, 
                avg_pos_val, avg_neg_val,
                nouns/l, adjectives/l, verbs/l, adverbs/l]



    def transform(self, documents):
        obj_val, pos_val, neg_val, nouns, adjectives, 
        verbs, adverbs = np.array([self._get_sentiments(d) 
                             for d in documents]).T

        allcaps = []
        exclamation = []
        question = []
        hashtag = []
        mentioning = []

        for d in documents:
            allcaps.append(np.sum([t.isupper() 
              for t in d.split() if len(t)>2]))

            exclamation.append(d.count("!"))
            question.append(d.count("?"))
            hashtag.append(d.count("#"))
            mentioning.append(d.count("@"))

        result = np.array([obj_val, pos_val, neg_val, 
                           nouns, adjectives, verbs, adverbs, 
                           allcaps, exclamation, question, 
                           hashtag, mentioning]).T

        return result

Putting everything together

Nevertheless, using these linguistic features in isolation without the words themselves will not take us very far. Therefore, we have to combine TfidfVectorizer with the linguistic features. This can be done with scikit-learn's FeatureUnion class. It is initialized the same way as Pipeline, but instead of evaluating the estimators in a sequence and each passing the output of the previous one to the next one, FeatureUnion does it in parallel and joins the output vectors afterwards:

def create_union_model(params=None):
    def preprocessor(tweet):
        tweet = tweet.lower()

        for k in emo_repl_order:
            tweet = tweet.replace(k, emo_repl[k])
        for r, repl in re_repl.iteritems():
            tweet = re.sub(r, repl, tweet)

        return tweet.replace("-", " ").replace("_", " ")

    tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,
                                   analyzer="word")
    ling_stats = LinguisticVectorizer()
    all_features = FeatureUnion([('ling', ling_stats), ('tfidf',tfidf_ngrams)])
    clf = MultinomialNB()
    pipeline = Pipeline([('all', all_features), ('clf', clf)])

    if params:
        pipeline.set_params(**params)

    return pipeline

Training and testing on the combined featurizers gives another 0.6 percent improvement on positive versus negative:

== Pos vs. neg ==
0.808    0.016    0.892    0.010    
== Pos/neg vs. irrelevant/neutral ==
0.794    0.009    0.707    0.033    
== Pos vs. rest ==
0.886    0.006    0.533    0.026    
== Neg vs. rest ==
0.881    0.012    0.629    0.037

With these results, we probably do not want to use the positive versus rest and negative versus rest classifiers, but instead use first the classifier determining whether the tweet contains sentiment at all ("pos/neg versus irrelevant/neutral") and then, when it does, use the positive versus negative classifier to determine the actual sentiment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.27.234