Cleaning tweets

New constraints lead to new forms. Twitter is no exception in this regard. Because text has to fit into 140 characters, people naturally develop new language shortcuts to say the same in less characters. So far, we have ignored all the diverse emoticons and abbreviations. Let's see how much we can improve by taking that into account. For this endeavor, we will have to provide our own preprocessor() to TfidfVectorizer.

First, we define a range of frequent emoticons and their replacements in a dictionary. Although we could find more distinct replacements, we go with obvious positive or negative words to help the classifier:

emo_repl = {
    # positive emoticons
    "<3": " good ",
    ":d": " good ", # :D in lower case
    ":dd": " good ", # :DD in lower case
    "8)": " good ",
    ":-)": " good ",
    ":)": " good ",
    ";)": " good ",
    "(-:": " good ",
    "(:": " good ",

    # negative emoticons:
    ":/": " bad ",
    ":>": " sad ",
    ":')": " sad ",
    ":-(": " bad ",
    ":(": " bad ",
    ":S": " bad ",
    ":-S": " bad ",
    }

# make sure that e.g. :dd is replaced before :d
emo_repl_order = [k for (k_len,k) in reversed(sorted([(len(k),k) for k in emo_repl.keys()]))]

Then, we define abbreviations as regular expressions together with their expansions ( marks the word boundary):

re_repl = {
    r"r": "are",
    r"u": "you",
    r"haha": "ha",
    r"hahaha": "ha",
    r"don't": "do not",
    r"doesn't": "does not",
    r"didn't": "did not",
    r"hasn't": "has not",
    r"haven't": "have not",
    r"hadn't": "had not",
    r"won't": "will not",
    r"wouldn't": "would not",
    r"can't": "can not",
    r"cannot": "can not",
    }

def create_ngram_model(params=None):
    def preprocessor(tweet):
        global emoticons_replaced
        tweet = tweet.lower()

        #return tweet.lower()
        for k in emo_repl_order:
            tweet = tweet.replace(k, emo_repl[k])
        for r, repl in re_repl.iteritems():
            tweet = re.sub(r, repl, tweet)

        return tweet

    tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,
                                   analyzer="word")
    # ...

Certainly, there are many more abbreviations that could be used here. But already with this limited set, we get an improvement for sentiment versus not sentiment of half a point, which comes to 70.7 percent:

== Pos vs. neg ==
0.804    0.022    0.886    0.011    
== Pos/neg vs. irrelevant/neutral ==
0.797    0.009    0.707    0.029    
== Pos vs. rest ==
0.884    0.005    0.527    0.025    
== Neg vs. rest ==
0.886    0.011    0.640    0.032    
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.163.158