Putting it all together

Now we have all the pieces. Let's look at how to put it all together:

We first ingest the dataset and then split the data out into training and cross validation sets. The dataset is split into ten parts for a k-fold cross-validation. We won't do that. Instead, we'll do a single fold cross-validation by holding out 30% of the data for cross-validation:

  typ := "bare"
  examples, err := ingest(typ)
  log.Printf("errs %v", err)
  log.Printf("Examples loaded: %d", len(examples))
  shuffle(examples)
  cvStart := len(examples) - len(examples)/3
  cv := examples[cvStart:]
  examples = examples[:cvStart]

We then train the classifier and then check to see whether the classifier can predict its own dataset well:

  c := New()
  c.Train(examples)

  var corrects, totals float64
  for _, ex := range examples {
    // log.Printf("%v", c.Score(ham.Document))
    class := c.Predict(ex.Document)
    if class == ex.Class {
      corrects++
    }
    totals++
  }
  log.Printf("Corrects: %v, Totals: %v. Accuracy %v", corrects, totals, corrects/totals)

After training the classifier, we perform a cross-validation on the data:

  log.Printf("Start Cross Validation (this classifier)")
  corrects, totals = 0, 0
  hams, spams := 0.0, 0.0
  var unseen, totalWords int
  for _, ex := range cv {
    totalWords += len(ex.Document)
    unseen += c.unseens(ex.Document)
    class := c.Predict(ex.Document)
    if class == ex.Class {
      corrects++
    }
    switch ex.Class {
    case Ham:
      hams++
    case Spam:
      spams++
    }
    totals++
  }

Here, I also added an unseen and totalWords count, as a simple statistic to see how well the classifier can generalize when encountering previously unseen words.

Additionally, because we know ahead of time that the dataset comprises roughly 80% Ham and 20% Spam, we have a baseline to beat. Simply put, we could write a classifier that does this:

type Classifier struct{}
func (c Classifier) Predict(sentence []string) Class { return Ham }

Imagine we have such a classifier. Then it would be right 80% of the time! For us to know that our classifier is good, it would have to beat a baseline. For the purposes of this chapter, we simply print out the statistics and tweak accordingly:

  fmt.Printf("Dataset: %q. Corrects: %v, Totals: %v. Accuracy %v
", typ, corrects, totals, corrects/totals)
  fmt.Printf("Hams: %v, Spams: %v. Ratio to beat: %v
", hams, spams, hams/(hams+spams))
  fmt.Printf("Previously unseen %d. Total Words %d
", unseen, totalWords)

So, this is what the final main function looks as follows:

func main() {
  typ := "bare"
  examples, err := ingest(typ)
  if err != nil {
    log.Fatal(err)
  }

  fmt.Printf("Examples loaded: %d
", len(examples))
  shuffle(examples)
  cvStart := len(examples) - len(examples)/3
  cv := examples[cvStart:]
  examples = examples[:cvStart]

  c := New()
  c.Train(examples)

  var corrects, totals float64
  for _, ex := range examples {
    // fmt.Printf("%v", c.Score(ham.Document))
    class := c.Predict(ex.Document)
    if class == ex.Class {
      corrects++
    }
    totals++
  }
  fmt.Printf("Dataset: %q. Corrects: %v, Totals: %v. Accuracy %v
", typ, corrects, totals, corrects/totals)

  fmt.Println("Start Cross Validation (this classifier)")
  corrects, totals = 0, 0
  hams, spams := 0.0, 0.0
  var unseen, totalWords int
  for _, ex := range cv {
    totalWords += len(ex.Document)
    unseen += c.unseens(ex.Document)
    class := c.Predict(ex.Document)
    if class == ex.Class {
      corrects++
    }
    switch ex.Class {
    case Ham:
      hams++
    case Spam:
      spams++
    }
    totals++
  }

  fmt.Printf("Dataset: %q. Corrects: %v, Totals: %v. Accuracy %v
", typ, corrects, totals, corrects/totals)
  fmt.Printf("Hams: %v, Spams: %v. Ratio to beat: %v
", hams, spams, hams/(hams+spams))
  fmt.Printf("Previously unseen %d. Total Words %d
", unseen, totalWords)
}

Running it on bare, this is the result I get the following:

Examples loaded: 2893
Dataset: "bare". Corrects: 1917, Totals: 1929. Accuracy 0.9937791601866252
Start Cross Validation (this classifier)
Dataset: "bare". Corrects: 946, Totals: 964. Accuracy 0.9813278008298755
Hams: 810, Spams: 154. Ratio to beat: 0.8402489626556017
Previously unseen 17593. Total Words 658105

To see the effects of removing stopwords and lemmatization, we simply switch to using the lemm_stop dataset, and this is the result I get the following:

Dataset: "lemm_stop". Corrects: 1920, Totals: 1929. Accuracy 0.995334370139969
Start Cross Validation (this classifier)
Dataset: "lemm_stop". Corrects: 948, Totals: 964. Accuracy 0.983402489626556
Hams: 810, Spams: 154. Ratio to beat: 0.8402489626556017
Previously unseen 16361. Total Words 489255

Either way, the classifier is brutally effective.

Table of Contents for Putting it all together

Create new playlist

Sign In

Sign Up

Table of Contents for
Putting it all together