Email spam dataset

In 2000, Androutsopoulos et al. collected one of the first email spam datasets to benchmark spam-filtering algorithms. They studied how the Naive Bayes classifier can be used to detect spam, if additional pipes such as stop list, stemmer, and lemmatization contribute to better performance. The dataset was reorganized by Andrew Ng in OpenClassroom's machine-learning class, available for download at http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html.

Select and download the second option, ex6DataEmails.zip, as shown in the following screenshot:

The ZIP contains the following folders:

  • The nonspam-train and spam-train folders contain the pre-processed emails that you will use for training. They have 350 emails each.
  • The nonspam-test and spam-test folders constitute the test set, containing 130 spam and 130 nonspam emails. These are the documents that you will make predictions on. Notice that, even though separate folders tell you the correct labeling, you should make your predictions on all of the test documents without this knowledge. After you make your predictions, you can use the correct labeling to check whether your classifications were correct.

To leverage Mallet's folder iterator, let's reorganize the folder structure as follows. We will create two folders, train and test, and put the spam/nospam folders under the corresponding folders. The initial folder structure is as shown in the following screenshot:

The final folder structure will be as shown in the following screenshot:

The next step is to transform email messages to feature vectors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.136.63