The project 

What we want to do is simple: given an email, is it kosher (which we call ham), or is it a spam email? We will be using the LingSpam database. The emails from that database are a little dated—spammers update their techniques and words all the time. However, I chose the LingSpam corpus for a good reason: it is already nicely preprocessed. The original scope of this chapter was to introduce the preprocessing of emails; however, the topic of preprocessing options for natural language is itself a topic for an entire book, so we will use a dataset that has already been preprocessed. This allows us to focus more on the mechanics of a very elegant algorithm.

Fear not, though, as I will actually walk through the brief basics of preprocessing. Be warned, however, that the level of complexity jumps up in a very steep curve, so be prepared to be sucked into a black hole of many hours on preprocessing natural language. At the end of this chapter, I will also recommend some libraries that will be useful for preprocessing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.