Exploratory data analysis 

Let's jump into the data. The LingSpam corpus comes with four variants of the same corpus: bare, lemm, lemm_stop, and stop. In each variant, there are ten parts and each part contains multiple files. Each file represents an email. Files with a spmsg prefix in its name are spam, while the rest are ham. An example email looks as follows (from the bare variant):

Subject: re : 2 . 882 s - > np np
> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of " john mcnamara the name " is tautologous and thus , at > that level , indistinguishable from " well , well now , what have we here ? " . to say that ' john mcnamara the name ' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were ' chaim shmendrik ' , ' john mcnamara the name ' would be false . no tautology , this . ( and no reduplication , either . )

Here are some things to note about this particular email:

  • This is an email about linguistics—specifically, about the parsing of a natural sentence into multiple noun phrases (np). This is a largely irrelevant fact to the project at hand. I do, however, think it's a good idea to go through the topics, if only to provide a sanity check on manual occasions.
  • There is an email and a person attached to this emailthe dataset is not particularly anonymized. This has some implications in the future of machine learning, which I will explore in the final chapter of this book.
  • The email is very nicely split into fields (that is, space separated for each word).
  • The email has a Subject line.

The first two points are particularly noteworthy. Sometimes, the subject matter actually matters in machine learning. In our case, we can build our algorithms to be blind—they can be used generically across all emails. But there are times where being context-sensitive will bring new heights to your machine-learning algorithms. The second thing to note is anonymity. We live in an age where software flaws are often the downfall of companies. Doing machine learning on non-anonymous datasets are often fraught with biases. We should try to anonymize data as much as possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.12.11