Stopwords

By reading this, I would assume the reader is familiar with English. And you may have noticed that some words are used more often than others. Words such as the, there, from, and so on. The task of classifying whether an email is spam or ham is inherently statistical in nature. When certain words are used often in a document (such as an email), it conveys more weight about what that document is about. For example, I received an email today about cats (I am a patron of the Cat Protection Society). The word cat or cats occurred eleven times out of the 120 or so words. It would not be difficult to assume that the email is about cats.

However, the word the showed up 19 times. If we were to classify the topic of the email by a count of words, the email would be classified under the topic the. Connective words such as these are useful in understanding the specific context of the sentences, but for a Naïve statistical analysis, they often add nothing more than noise. So, we have to remove them.

Stopwords are often specific to projects, and I'm not a particular fan of removing them outright. However, the LingSpam corpus has two variants: stop and lemm_stop, which has the stopwords list applied, and the stopwords removed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.30.210