Preprocessing stopwords

Enough about normalization. We now turn our focus to stopwords.

Recall from Chapter 2, Linear Regression–House Price Prediction, that stopwords are words such as the, there, from, and so on. They're connective words, useful in understanding the specific context of sentences, but for a naive statistical analysis, they often add nothing more than noise. So, we have to remove them.

A check for stopwords is simple. If a word matches a stopwords, we'll return false for whether to add the word ID into the sentence:

if _, ok = stopwords[word]; ok {
return -1, false
}

Where does the list of stopwords come from? It's simple enough that I just wrote this in stopwords.go:

const sw = `a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot can't cant co co. computer con could couldnt couldn't cry de describe detail did didn didn't didnt do does doesn doesn't doesnt doing don done down due during each eg e.g eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fify fill find fire first five for former formerly forty found four from front full further get give go had has hasnt hasn't hasn have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie i.e. if in inc indeed interest into is it its itself just keep kg km last latter latterly least less ltd made make many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put quite rather re really regarding same say see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under unless until up upon us used using various very via was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves`
var stopwords = make(map[string]struct{})
func init() {
for _, s := range strings.Split(sw, " ") {
stopwords[s] = struct{}{}
}
}

And that's it! A tweet with content that looks like this—an apple a day keeps the doctor away would have the IDs for apple, day, doctor, and away.

The list of stopwords is adapted from the list that is used in the lingo package. The list of stopwords in the lingo package is meant to be used on lemmatized words. Because we're not lemmatizing, some words were manually added. It's not perfect but works well enough for our purpose.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.72.6