E-mail spam detection

Spam or electronic spam refers to unsolicited messages, typically carrying advertising content, infected attachments, links to phishing or malware sites, and so on. While the most widely recognized form of spam is e-mail spam, spam abuses appear in other media as well: website comments, instant messaging, Internet forums, blogs, online ads, and so on.

In this chapter, we will discuss how to build naive Bayesian spam filtering, using bag-of-words representation to identify spam e-mails. The naive Bayes spam filtering is one of the basic techniques that was implemented in the first commercial spam filters; for instance, Mozilla Thunderbird mail client uses native implementation of such filtering. While the example in this chapter will use e-mail spam, the underlying methodology can be applied to other type of text-based spam as well.

E-mail spam dataset

Androutsopoulos et al. (2000) collected one of the first e-mail spam datasets to benchmark spam-filtering algorithms. They studied how the naive Bayes classifier can be used to detect spam, if additional pipes such as stop list, stemmer, and lemmatization contribute to better performance. The dataset was reorganized by Andrew Ng in OpenClassroom's machine learning class, available for download at http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html.

Select and download the second option, ex6DataEmails.zip, as shown in the following image:

E-mail spam dataset

The ZIP contains four folders (Ng, 2015):

  • The nonspam-train and spam-train folders contain the pre-processed e-mails that you will use for training. They have 350 e-mails each.
  • The nonspam-test and spam-test folders constitute the test set, containing 130 spam and 130 nonspam e-mails. These are the documents that you will make predictions on. Notice that even though separate folders tell you the correct labeling, you should make your predictions on all the test documents without this knowledge. After you make your predictions, you can use the correct labeling to check whether your classifications were correct.

To leverage Mallet's folder iterator, let's reorganize the folder structure as follows. Create two folders, train and test, and put the spam/nospam folders under the corresponding folders. The initial folder structure is as shown in the following image:

E-mail spam dataset

The final folder structure will be as shown in the following image:

E-mail spam dataset

The next step is to transform e-mail messages to feature vectors.

Feature generation

Create a default pipeline as described previously:

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new Input2CharSequence("UTF-8"));
Pattern tokenPattern = Pattern.compile("[\p{L}\p{N}_]+");
pipeList.add(new CharSequence2TokenSequence(tokenPattern));
pipeList.add(new TokenSequenceLowercase());
pipeList.add(new TokenSequenceRemoveStopwords(new File(stopListFilePath), "utf-8", false, false, false));
pipeList.add(new TokenSequence2FeatureSequence());
pipeList.add(new FeatureSequence2FeatureVector());
pipeList.add(new Target2Label());
SerialPipes pipeline = new SerialPipes(pipeList);

Note that we added an additional FeatureSequence2FeatureVector pipe that transforms a feature sequence into a feature vector. When we have data in a feature vector, we can use any classification algorithm as we saw in the previous chapters. We'll continue our example in Mallet to demonstrate how to build a classification model.

Next, initialize a folder iterator to load our examples in the train folder comprising e-mail examples in the spam and nonspam subfolders, which will be used as example labels:

FileIterator folderIterator = new FileIterator(
    new File[] {new File(dataFolderPath)},
    new TxtFilter(),
    FileIterator.LAST_DIRECTORY);

Construct a new instance list with the pipeline that we want to use to process the text:

InstanceList instances = new InstanceList(pipeline);

Finally, process each instance provided by the iterator:

instances.addThruPipe(folderIterator);

We have now loaded the data and transformed it into feature vectors. Let's train our model on the training set and predict the spam/nonspam classification on the test set.

Training and testing

Mallet implements a set of classifiers in the cc.mallet.classify package, including decision trees, naive Bayes, AdaBoost, bagging, boosting, and many others. We'll start with a basic classifier, that is, a naive Bayes classifier. A classifier is initialized by the ClassifierTrainer class, which returns a classifier when we invoke its train(Instances) method:

ClassifierTrainer classifierTrainer = new NaiveBayesTrainer();
Classifier classifier = classifierTrainer.train(instances);

Now let's see how this classier works and evaluate its performance on a separate dataset.

Model performance

To evaluate the classifier on a separate dataset, let's start by importing the e-mails located in our test folder:

InstanceList testInstances = new InstanceList(classifier.getInstancePipe());
folderIterator = new FileIterator(
    new File[] {new File(testFolderPath)},
    new TxtFilter(),
    FileIterator.LAST_DIRECTORY);

We will pass the data through the same pipeline that we initialized during training:

testInstances.addThruPipe(folderIterator);

To evaluate classifier performance, we'll use the cc.mallet.classify.Trial class, which is initialized with a classifier and set of test instances:

Trial trial = new Trial(classifier, testInstances);

The evaluation is performed immediately at initialization. We can then simply take out the measures that we care about. In our example, we'd like to check the precision and recall on classifying spam e-mail messages, or F-measure, which returns a harmonic mean of both values, as follows:

System.out.println(
  "F1 for class 'spam': " + trial.getF1("spam"));
System.out.println(
  "Precision:" + trial.getPrecision(1));
System.out.println(
  "Recall:" + trial.getRecall(1));

The evaluation object outputs the following results:

F1 for class 'spam': 0.9731800766283524
Precision: 0.9694656488549618
Recall: 0.9769230769230769

The results show that the model correctly discovers 97.69% of spam messages (recall), and when it marks an e-mail as spam, it is correct in 96.94% cases. In other words, it misses approximately 2 per 100 spam messages and marks 3 per 100 valid messages as spam. Not really perfect, but it is more than a good start!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.174.191