Classifying emails using the Naive Bayes classifier

The final task of this chapter will be to apply our newly gained skills to a real spam filter! This task deals with solving a binary-class (spam/ham) classification problem using the Naive Bayes algorithm.

Naive Bayes classifiers are actually a very popular model for email filtering. Their naivety lends itself nicely to the analysis of text data, where each feature is a word (or a bag of words), and it would not be feasible to model the dependence of every word on every other word.

There are a bunch of good email datasets out there, such as the following:

The Hewlett-Packard spam database: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase
The Enrom-Spam dataset: http://www.aueb.gr/users/ion/data/enron-spam

In this section, we will be using the Enrom-Spam dataset, which can be downloaded for free from the given website. However, if you followed the installation instructions at the beginning of this book and have downloaded the latest code from GitHub, you are already good to go!

Table of Contents for Classifying emails using the Naive Bayes classifier

Create new playlist

Sign In

Sign Up

Table of Contents for
Classifying emails using the Naive Bayes classifier