Dropping stop words

We didn't talk about stop_words as an important parameter in CountVectorizer. Stop words are those common words that provide little value in helping documents differentiate themselves. In general, stop words add noise to the BoW model and can be removed.

There's no universal list of stop words. Hence, depending on the tools or packages you are using, you will remove different sets of stop words. Take scikit-learn as an example—you can check the list as follows:

>>> from sklearn.feature_extraction import stop_words
>>> print(stop_words.ENGLISH_STOP_WORDS)
frozenset({'most', 'three', 'between', 'anyway', 'made', 'mine', 'none', 'could', 'last', 'whenever', 'cant', 'more', 'where', 'becomes', 'its', 'this', 'front', 'interest', 'least', 're', 'it', 'every', 'four', 'else', 'over', 'any', 'very', 'well', 'never', 'keep', 'no', 'anything', 'itself', 'alone', 'anyhow', 'until', 'therefore', 'only', 'the', 'even', 'so', 'latterly', 'above', 'hereafter', 'hereby', 'may', 'myself', 'all', 'those', 'down',
……
……
'him', 'somehow', 'or', 'per', 'nowhere', 'fifteen', 'via', 'must', 'someone', 'from', 'full', 'that', 'beyond', 'still', 'to', 'get', 'himself', 'however', 'as', 'forty', 'whatever', 'his', 'nothing', 'though', 'almost', 'become', 'call', 'empty', 'herein', 'than', 'while', 'bill', 'thru', 'mostly', 'yourself', 'up', 'former', 'each', 'anyone', 'hundred', 'several', 'others', 'along', 'bottom', 'one', 'five', 'therein', 'was', 'ever', 'beside', 'everyone'})

To drop stop words from the newsgroups data, we simply just need to specify the stop_words parameter:

>>> count_vector_sw = CountVectorizer(stop_words="english", max_features=500)

Besides stop words, you may notice names are included in the top features, such as andrew. We can filter names with the Name corpus from NLTK we just worked with.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.184.102