Tweaking the preprocessing step 

One thing you may note is that the preprocessing of tweets is very minimal, and some of the rules are odd. For example, all hashtags are treated as one, as are all links and mentions. When this project started, it seemed like a good reason. There is no other justification than it seemed like a good reason; one always needs a springboard from which to jump off in any project. A flimsy excuse at that point is as good as any other.

Nonetheless, I have tweaked my preprocessing steps. These are the functions that I finally settled on. Do observe the difference between this and the original, listed in previous sections:

 var nl = regexp.MustCompile("
+")
var ht = regexp.MustCompile("&.+?;")
func (p *processor) single(word string) (wordID int, ok bool) {
if _, ok = stopwords[word]; ok {
return -1, false
}
switch {
case strings.HasPrefix(word, "#"):
word = strings.TrimPrefix(word, "#")
case word == "@":
return -1, false // at is a stop word!
case strings.HasPrefix(word, "http"):
return -1, false
}
if word == "rt" {
return -1, false
}
return p.corpus.Add(word), true
}
func (p *processor) process(a []*processedTweet) []*processedTweet {
// remove things from consideration
i := 0
for _, tt := range a {
if tt.Lang == "en" {
a[i] = tt
i++
}
}
a = a[:i]
var err error
for _, tt := range a {
if tt.RetweetedStatus != nil {
tt.Tweet = *tt.RetweetedStatus
}
tt.clean, _, err = transform.String(p.transformer, tt.FullText)
dieIfErr(err)
tt.clean = strings.ToLower(tt.clean)
tt.clean = nl.ReplaceAllString(tt.clean, " ")
tt.clean = ht.ReplaceAllString(tt.clean, "")
tt.clean = stripPunct(tt.clean)
log.Printf("%v", tt.clean)
for _, word := range strings.Fields(tt.clean) {
// word = corpus.Singularize(word)
wordID, ok := p.single(word)
if ok {
tt.ids = append(tt.ids, wordID)
tt.clean2 += " "
tt.clean2 += word
}
if word == "rt" {
tt.isRT = true
}
}
p.tfidf.Add(tt)
log.Printf("%v", tt.clean2)
}
p.tfidf.CalculateIDF()
// calculate scores
for _, tt := range a {
tt.textVec = p.tfidf.Score(tt)
}
// normalize text vector
size := p.corpus.Size()
for _, tt := range a {
tt.normTextVec = make([]float64, size)
for i := range tt.ids {
tt.normTextVec[tt.ids[i]] = tt.textVec[i]
}
}
return a
}
func stripPunct(a string) string {
const punct = ",.?;:'"!’*-“"
return strings.Map(func(r rune) rune {
if strings.IndexRune(punct, r) < 0 {
return r
}
return -1
}, a)
}

The most notable thing that I have changed is that I now consider a hashtag a word. Mentions are removed. As for URLs, in one of the attempts at clustering, I realized that the clustering algorithms were clustering all the tweets with a URL into the same cluster. That realization made me remove hashtags, mentions, and URLs. Hashtags have the # removed and are treated as if they were normal words.

Furthermore, you may note that I added some quick and dirty ways to clean certain things:

 tt.clean = strings.ToLower(tt.clean)
tt.clean = nl.ReplaceAllString(tt.clean, " ")
tt.clean = ht.ReplaceAllString(tt.clean, "")
tt.clean = stripPunct(tt.clean)

Here, I used regular expressions to replace multiple newlines with just one, and to replace all HTML-encoded text with nothing. Lastly, I removed all punctuation.

In a more formal setting, I would use a proper lexer to handle my text. The lexer I'd use would come from Lingo (github.com/chewxy/lingo). But given that Twitter is a low value environment, there wasn't much point in doing so. A proper lexer like the one in lingo flags text as multiple things, allowing for easy removal.

Another thing you might notice is that I changed the definition of what a tweet is mid-flight:

 if tt.RetweetedStatus != nil {
tt.Tweet = *tt.RetweetedStatus
}

This block of code says if a tweet is indeed a retweeted status, replace the tweet with the retweeted tweet. This works for me. But it may not work for you. I personally consider any retweet to be the same as repeating a tweet. So, I do not see why they should be separate. Additionally, Twitter allows for users to comment on a retweet. If you want to include that, you'd have to change the logic a little bit more. Either way, the way I got to this was by manually inspecting the JSON file I had saved.

It's asking these questions and then making a judgment call what is important in doing data science, either in Go or any other language. It's not about blindly applying algorithms. Rather, it's always driven by what the data tells you.

One last thing that you may note is this curious block of code:

 // remove things from consideration
i := 0
for _, tt := range a {
if tt.Lang == "en" {
a[i] = tt
i++
}
}
a = a[:i]

Here, I only consider English tweets. I follow many people who tweet in a variety of languages. At any given time, my home timeline would have about 15% of tweets in French, Chinese, Japanese, or German. Clustering tweets in a different language is a whole different ballgame, so I chose to omit them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.42.128