Preprocessing a single word 

The p.single processes a single word. It returns the ID of the word, and whether to add it to the list of words that make up the tweet. It is defined as follows:

 func (p *processor) single(a string) (wordID int, ok bool) {
word := strings.ToLower(a)
if _, ok = stopwords[word]; ok {
return -1, false
}
if strings.HasPrefix(word, "#") {
return p.corpus.Add(hashtag), true
}
if strings.HasPrefix(word, "@") {
return p.corpus.Add(mention), true
}
if strings.HasPrefix(word, "http://") {
return p.corpus.Add(url), true
}
if isRT(word) {
return p.corpus.Add(retweet), false
}
return p.corpus.Add(word), true
}

We start by making the word lowercase. This makes words such as café and Café equivalent.

Speaking of café, what would happen if there are two tweets mentioning a café, but one user writes café and the other writes cafe? Assume, of course, they both refer to the same thing. We'd need some form of normalization to tell us that they're the same.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62