Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Cleaning tweets

New constraints lead to new forms. Twitter is no exception in this regard. Because text has to fit into 140 characters, people naturally develop new language shortcuts to say the same in less characters. So far, we have ignored all the diverse emoticons and abbreviations. Let's see how much we can improve by taking that into account. For this endeavor, we will have to provide our own preprocessor() to TfidfVectorizer.

First, we define a range of frequent emoticons and their replacements in a dictionary. Although we could find more distinct replacements, we go with obvious positive or negative words to help the classifier:

emo_repl = {
    # positive emoticons
    "&lt;3": " good ",
    ":d": " good ", # :D in lower case
    ":dd": " good ", # :DD in lower case
    "8)": " good ",
    ":-)": " good ",
    ":)": " good ",
    ";)": " good ",
    "(-:": " good ",
    "(:": " good ",

    # negative emoticons:
    ":/": " bad ",
    ":&gt;": " sad ",
    ":')": " sad ",
    ":-(": " bad ",
    ":(": " bad ",
    ":S": " bad ",
    ":-S": " bad ",
    }

# make sure that e.g. :dd is replaced before :d
emo_repl_order = [k for (k_len,k) in reversed(sorted([(len(k),k) for k in emo_repl.keys()]))]

Then, we define abbreviations as regular expressions together with their expansions ( marks the word boundary):

re_repl = {
    r"r": "are",
    r"u": "you",
    r"haha": "ha",
    r"hahaha": "ha",
    r"don't": "do not",
    r"doesn't": "does not",
    r"didn't": "did not",
    r"hasn't": "has not",
    r"haven't": "have not",
    r"hadn't": "had not",
    r"won't": "will not",
    r"wouldn't": "would not",
    r"can't": "can not",
    r"cannot": "can not",
    }

def create_ngram_model(params=None):
    def preprocessor(tweet):
        global emoticons_replaced
        tweet = tweet.lower()

        #return tweet.lower()
        for k in emo_repl_order:
            tweet = tweet.replace(k, emo_repl[k])
        for r, repl in re_repl.iteritems():
            tweet = re.sub(r, repl, tweet)

        return tweet

    tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,
                                   analyzer="word")
    # ...

Certainly, there are many more abbreviations that could be used here. But already with this limited set, we get an improvement for sentiment versus not sentiment of half a point, which comes to 70.7 percent:

== Pos vs. neg ==
0.804    0.022    0.886    0.011    
== Pos/neg vs. irrelevant/neutral ==
0.797    0.009    0.707    0.029    
== Pos vs. rest ==
0.884    0.005    0.527    0.025    
== Neg vs. rest ==
0.886    0.011    0.640    0.032

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Cleaning tweets

Create new playlist

Sign In

Sign Up

Cleaning tweets

Table of Contents for
Cleaning tweets