Spam filtering

Let's put all we've learned to good use and do some data exploration for our spam filter. We will use the Ling-Spam email dataset: http://csmining.org/index.php/ling-spam-datasets.html. The dataset contains 2412 ham emails and 481 spam emails, all of which were received by a mailing list on linguistics. We will extract the words that are most informative of whether an email is spam or ham.

The first steps in any natural language processing workflow are to remove stop words and lemmatization. Removing stop words involves filtering very common words such as the, this and so on. Lemmatization involves replacing different forms of the same word with a canonical form: both colors and color would be mapped to color, and organize, organizing and organizes would be mapped to organize. Removing stop words and lemmatization is very challenging, and beyond the scope of this book (if you do need to remove stop words and lemmatize a dataset, your go-to tool should be the Stanford NLP toolkit: http://nlp.stanford.edu/software/corenlp.shtml). Fortunately, the Ling-Spam e-mail dataset has been cleaned and lemmatized already (which is why the text in the emails looks strange).

When we do build the spam filter, we will use the presence of a particular word in an email as the feature for our model. We will use a bag-of-words approach: we consider which words appear in an email, but not the word order.

Intuitively, some words will be more important than others when deciding whether an email is spam. For instance, an email that contains language is likely to be ham, since the mailing list was for linguistics discussions, and language is a word unlikely to be used by spammers. Conversely, words which are common to both message types, for instance hello, are unlikely to be much use.

One way of quantifying the importance of a word in determining whether a message is spam is through the Mutual Information (MI). The mutual information is the gain in information about whether a message is ham or spam if we know that it contains a particular word. For instance, the presence of language in a particular email is very informative as to whether that email is spam or ham. Similarly, the presence of the word dollar is informative since it appears often in spam messages and only infrequently in ham messages. By contrast, the presence of the word morning is uninformative, since it is approximately equally common in both spam and ham messages. The formula for the mutual information between the presence of a particular word in an email, and whether that email is spam or ham is:

Spam filtering

where Spam filtering is the joint probability of an email containing a particular word and being of that class (either ham or spam), Spam filtering is the probability that a particular word is present in an email, and Spam filtering is the probability that any email is of that class. The MI is commonly used in decision trees.

Note

The derivation of the expression for the mutual information is beyond the scope of this book. The interested reader is directed to David MacKay's excellent Information Theory, Inference, and Learning Algorithms, especially the chapter Dependent Random Variables.

A key component of our MI calculation is evaluating the probability that a word occurs in spam or ham messages. The best approximation to this probability, given our data set, is the fraction of messages a word appears in. Thus, for instance, if language appears in 40% of messages, we will assume that the probability Spam filtering of language being present in any message is 0.4. Similarly, if 40% of the messages are ham, and language appears in 50% of those, we will assume that the probability of language being present in an email, and that email being ham is Spam filtering.

Let's write a wordFractionInFiles function to calculate the fraction of messages in which each word appears, for all the words in a given corpus. Our function will take, as argument, a path with a shell wildcard identifying a set of files, such as ham/*, and it will return a key-value RDD, where the keys are words and the values are the probability that that word occurs in any of those files. We will put the function in an object called MutualInformation.

We first give the entire code listing for this function. Don't worry if this doesn't all make sense straight-away: we explain the tricky parts in more detail just after the code. You may find it useful to type some of these commands in the shell, replacing fileGlob with, for instance "ham/*":

// MutualInformation.scala
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD

object MutualInformation extends App {

  def wordFractionInFiles(sc:SparkContext)(fileGlob:String)
  :(RDD[(String, Double)], Long) = {

    // A set of punctuation words that need to be filtered out.
    val wordsToOmit = Set[String](
      "", ".", ",", ":", "-", """, "'", ")", 
      "(", "@", "/", "Subject:"
    )

    val messages = sc.wholeTextFiles(fileGlob)
    // wholeTextFiles generates a key-value RDD of 
    // file name -> file content

    val nMessages = messages.count()

    // Split the content of each message into a Set of unique
    // words in that message, and generate a new RDD mapping:
    // message -> word
    val message2Word = messages.flatMapValues {
      mailBody => mailBody.split("\s").toSet
    }


    val message2FilteredWords = message2Word.filter { 
      case(email, word) => ! wordsToOmit(word) 
    }

    val word2Message = message2FilteredWords.map { _.swap }

    // word -> number of messages it appears in.
    val word2NumberMessages = word2Message.mapValues { 
      _ => 1 
    }.reduceByKey { _ + _ }

    // word -> fraction of messages it appears in
    val pPresent = word2NumberMessages.mapValues { 
      _ / nMessages.toDouble 
    }
  
    (pPresent, nMessages)
  }
}

Let's play with this function in the Spark shell. To be able to access this function from the shell, we need to create a jar with the MutualInformation object. Write a build.sbt file similar to the one presented in the previous section and package the code into a jar using sbt package. Then, open a Spark shell with:

$ spark-shell --jars=target/scala-2.10/spam_mi_2.10-0.1-SNAPSHOT.jar

This will open a Spark shell with our newly created jar on the classpath. Let's run our wordFractionInFiles method on the ham emails:

scala> import MutualInformation._
import MutualInformation._

scala> val (fractions, nMessages) = wordFractionInFiles(sc)("ham/*")
fractions: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[13] at mapValues
nMessages: Long = 2412

Let's get a snapshot of the fractions RDD:

scala> fractions.take(5)
Array[(String, Double)] = Array((rule-base,0.002902155887230514), (reunion,4.1459369817578774E-4), (embarrasingly,4.1459369817578774E-4), (mller,8.291873963515755E-4), (sapore,4.1459369817578774E-4))

It would be nice to see the words that come up most often in ham messages. We can use the .takeOrdered action to take the top values of an RDD, with a custom ordering. .takeOrdered expects, as its second argument, an instance of the type class Ordering[T], where T is the type parameter of our RDD: (String, Double) in this case. Ordering[T] is a trait with a single compare(a:T, b:T) method describing how to compare a and b. The easiest way of creating an Ordering[T] is through the companion object's by method, which defines a key by which to compare the elements of our RDD.

We want to order the elements in our key-value RDD by the value and, since we want the most common words, rather than the least, we need to reverse that ordering:

scala> fractions.takeOrdered(5)(Ordering.by { - _._2 })
res0: Array[(String, Double)] = Array((language,0.6737147595356551), (university,0.6048922056384743), (linguistic,0.5149253731343284), (information,0.45480928689883915), ('s,0.4369817578772803))

Unsurprisingly, language is present in 67% of ham emails, university in 60% of ham emails and so on. A similar investigation on spam messages reveals that the exclamation mark character ! is present in 83% of spam emails, our is present in 61% and free in 57%.

We are now in a position to start writing the body of our application to calculate the mutual information between each word and whether a message is spam or ham. We will put the body of the code in the MutualInformation object, which already contains the wordFractionInFiles method.

The first step is to create a Spark context:

// MutualInformation.scala
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD

object MutualInformation extends App {
  
  def wordFractionInFiles(sc:SparkContext)(fileGlob:String)
  :(RDD[(String, Double)], Long) = {
    ...
  }
  
  val conf = new SparkConf().setAppName("lingSpam")
  val sc = new SparkContext(conf)

Note that we did not need to do this when we were using the Spark shell because the shell comes with a pre-built context bound to the variable sc.

We can now calculate the conditional probabilities of a message containing a particular word given that it is spam, Spam filtering. This is just the fraction of messages containing that word in the spam corpus. This, in turn, lets us infer the joint probability of a message containing a certain word and being spam Spam filtering. We will do this for all four combinations of classes: whether any given word is present or absent in a message, and whether that message is spam or ham:

    /* Conditional probabilities RDD:
       word -> P(present | spam) 
    */
    val (pPresentGivenSpam, nSpam) = wordFractionInFiles(sc)("spam/*")
    val pAbsentGivenSpam = pPresentGivenSpam.mapValues { 1.0 - _ }
    val (pPresentGivenHam, nHam) = wordFractionInFiles(sc)("ham/*")
    val pAbsentGivenHam = pPresentGivenHam.mapValues { 1.0 - _ }

    // pSpam is the fraction of spam messages
    val nMessages = nSpam + nHam
    val pSpam = nSpam / nMessages.toDouble
    
    // pHam is the fraction of ham messages
    val pHam = 1.0 - pSpam 

    /* pPresentAndSpam is a key-value RDD of joint probabilities
       word -> P(word present, spam) 
    */
    val pPresentAndSpam = pPresentGivenSpam.mapValues { 
      _ * pSpam 
    }
    val pPresentAndHam = pPresentGivenHam.mapValues { _ * pHam }
    val pAbsentAndSpam = pAbsentGivenSpam.mapValues { _ * pSpam }
    val pAbsentAndHam = pAbsentGivenHam.mapValues { _ * pHam }

We will re-use these RDDs in several places in the calculation, so let's tell Spark to keep them in memory to avoid having to re-calculate them:

    pPresentAndSpam.persist
    pPresentAndHam.persist
    pAbsentAndSpam.persist
    pAbsentAndHam.persist

We now need to calculate the probabilities of words being present, Spam filtering. This is just the sum of pPresentAndSpam and pPresentAndHam, for each word. The tricky part is that not all words are present in both the ham and spam messages. We must therefore do a full outer join of those RDDs. This will give an RDD mapping each word to a pair of Option[Double] values. For words absent in either the ham or spam messages, we must use a default value. A sensible default is Spam filtering for spam messages (a more rigorous approach would be to use additive smoothing). This implies that the word would appear once if the corpus was twice as large.

    val pJoined = pPresentAndSpam.fullOuterJoin(pPresentAndHam)
    val pJoinedDefault = pJoined.mapValues {
      case (presentAndSpam, presentAndHam) => 
        (presentAndSpam.getOrElse(0.5/nSpam * pSpam), 
        presentAndHam.getOrElse(0.5/nHam * pHam))
    }

Note that we could also have chosen 0 as the default value. This complicates the information gain calculation somewhat, since we cannot just take the log of a zero value, and it seems unlikely that a particular word has exactly zero probability of occurring in an email.

We can now construct an RDD mapping words to Spam filtering, the probability that a word exists in either a spam or a ham message:

    val pPresent = pJoinedDefault.mapValues { 
      case(presentAndHam, presentAndSpam) => 
        presentAndHam + presentAndSpam 
    }
    pPresent.persist

    val pAbsent = pPresent.mapValues { 1.0 - _ }
    pAbsent.persist

We now have all the RDDs that we need to calculate the mutual information between the presence of a word in a message and whether it is ham or spam. We need to bring them all together using the equation for the mutual information outlined earlier.

We will start by defining a helper method that, given an RDD of joint probabilities P(X, Y) and marginal probabilities P(X) and P(Y), calculates Spam filtering. Here, P(X) could, for instance, be the probability of a word being present in a message Spam filtering and P(Y) would be the probability that that message is spam, Spam filtering:

    def miTerm(
      pXYs:RDD[(String, Double)], 
      pXs:RDD[(String, Double)], 
      pY: Double,
      default: Double // for words absent in PXY
    ):RDD[(String, Double)] = 
      pXs.leftOuterJoin(pXYs).mapValues { 
        case (pX, Some(pXY)) => pXY * math.log(pXY/(pX*pY)) 
        case (pX, None) => default * math.log(default/(pX*pY))
    }

We can use our function to calculate the four terms in the mutual information sum:

    val miTerms = List(
      miTerm(pPresentAndSpam, pPresent, pSpam, 0.5/nSpam * pSpam),
      miTerm(pPresentAndHam, pPresent, pHam, 0.5/nHam * pHam),
      miTerm(pAbsentAndSpam, pAbsent, pSpam, 0.5/nSpam * pSpam),
      miTerm(pAbsentAndHam, pAbsent, pHam, 0.5/nHam * pHam)
    )

Finally, we just need to sum those four terms together:

    val mutualInformation = miTerms.reduce { 
      (term1, term2) => term1.join(term2).mapValues { 
         case (l, r) => l + r 
      } 
    }

The RDD mutualInformation is a key-value RDD mapping each word to a measure of how informative the presence of that word is in discerning whether a message is spam or ham. Let's print out the twenty words that are most informative of whether a message is ham or spam:

    mutualInformation.takeOrdered(20)(Ordering.by { - _._2 })
      .foreach { println }

Let's run this using spark-submit:

$ sbt package
$ spark-submit target/scala-2.10/spam_mi_2.10-0.1-SNAPSHOT.jar
(!,0.1479941771292119)
(language,0.14574624861510874)
(remove,0.11380645864246142)
(free,0.1073496947123657)
(university,0.10695975885487692)
(money,0.07531772498093084)
(click,0.06887598051593441)
(our,0.058950906866052394)
(today,0.05485248095680509)
(sell,0.05385519653184113)
(english,0.053509319455430575)
(business,0.05299311289740539)
(market,0.05248394151802276)
(product,0.05096229706182162)
(million,0.050233193237964546)
(linguistics,0.04990172586630499)
(internet,0.04974101556655623)
(company,0.04941817269989519)
(%,0.04890193809823071)
(save,0.04861393414892205)

Thus, we find that the presence of words like language or free or ! carry the most information, because they are almost exclusively present in either just spam messages or just ham messages. A very simple classification algorithm could just take the top 10 (by mutual information) spam words, and the top 10 ham words and see whether a message contains more spam words or ham words. We will explore machine learning algorithms for classification in more depth in Chapter 12, Distributed Machine Learning with MLlib.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.134