Now we have all the pieces. Let's look at how to put it all together:
- We first ingest the dataset and then split the data out into training and cross validation sets. The dataset is split into ten parts for a k-fold cross-validation. We won't do that. Instead, we'll do a single fold cross-validation by holding out 30% of the data for cross-validation:
typ := "bare"
examples, err := ingest(typ)
log.Printf("errs %v", err)
log.Printf("Examples loaded: %d", len(examples))
shuffle(examples)
cvStart := len(examples) - len(examples)/3
cv := examples[cvStart:]
examples = examples[:cvStart]
- We then train the classifier and then check to see whether the classifier can predict its own dataset well:
c := New()
c.Train(examples)
var corrects, totals float64
for _, ex := range examples {
// log.Printf("%v", c.Score(ham.Document))
class := c.Predict(ex.Document)
if class == ex.Class {
corrects++
}
totals++
}
log.Printf("Corrects: %v, Totals: %v. Accuracy %v", corrects, totals, corrects/totals)
- After training the classifier, we perform a cross-validation on the data:
log.Printf("Start Cross Validation (this classifier)")
corrects, totals = 0, 0
hams, spams := 0.0, 0.0
var unseen, totalWords int
for _, ex := range cv {
totalWords += len(ex.Document)
unseen += c.unseens(ex.Document)
class := c.Predict(ex.Document)
if class == ex.Class {
corrects++
}
switch ex.Class {
case Ham:
hams++
case Spam:
spams++
}
totals++
}
- Here, I also added an unseen and totalWords count, as a simple statistic to see how well the classifier can generalize when encountering previously unseen words.
Additionally, because we know ahead of time that the dataset comprises roughly 80% Ham and 20% Spam, we have a baseline to beat. Simply put, we could write a classifier that does this:
type Classifier struct{}
func (c Classifier) Predict(sentence []string) Class { return Ham }
Imagine we have such a classifier. Then it would be right 80% of the time! For us to know that our classifier is good, it would have to beat a baseline. For the purposes of this chapter, we simply print out the statistics and tweak accordingly:
fmt.Printf("Dataset: %q. Corrects: %v, Totals: %v. Accuracy %v ", typ, corrects, totals, corrects/totals)
fmt.Printf("Hams: %v, Spams: %v. Ratio to beat: %v ", hams, spams, hams/(hams+spams))
fmt.Printf("Previously unseen %d. Total Words %d ", unseen, totalWords)
So, this is what the final main function looks as follows:
func main() {
typ := "bare"
examples, err := ingest(typ)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Examples loaded: %d ", len(examples))
shuffle(examples)
cvStart := len(examples) - len(examples)/3
cv := examples[cvStart:]
examples = examples[:cvStart]
c := New()
c.Train(examples)
var corrects, totals float64
for _, ex := range examples {
// fmt.Printf("%v", c.Score(ham.Document))
class := c.Predict(ex.Document)
if class == ex.Class {
corrects++
}
totals++
}
fmt.Printf("Dataset: %q. Corrects: %v, Totals: %v. Accuracy %v ", typ, corrects, totals, corrects/totals)
fmt.Println("Start Cross Validation (this classifier)")
corrects, totals = 0, 0
hams, spams := 0.0, 0.0
var unseen, totalWords int
for _, ex := range cv {
totalWords += len(ex.Document)
unseen += c.unseens(ex.Document)
class := c.Predict(ex.Document)
if class == ex.Class {
corrects++
}
switch ex.Class {
case Ham:
hams++
case Spam:
spams++
}
totals++
}
fmt.Printf("Dataset: %q. Corrects: %v, Totals: %v. Accuracy %v ", typ, corrects, totals, corrects/totals)
fmt.Printf("Hams: %v, Spams: %v. Ratio to beat: %v ", hams, spams, hams/(hams+spams))
fmt.Printf("Previously unseen %d. Total Words %d ", unseen, totalWords)
}
Running it on bare, this is the result I get the following:
Examples loaded: 2893
Dataset: "bare". Corrects: 1917, Totals: 1929. Accuracy 0.9937791601866252
Start Cross Validation (this classifier)
Dataset: "bare". Corrects: 946, Totals: 964. Accuracy 0.9813278008298755
Hams: 810, Spams: 154. Ratio to beat: 0.8402489626556017
Previously unseen 17593. Total Words 658105
To see the effects of removing stopwords and lemmatization, we simply switch to using the lemm_stop dataset, and this is the result I get the following:
Dataset: "lemm_stop". Corrects: 1920, Totals: 1929. Accuracy 0.995334370139969
Start Cross Validation (this classifier)
Dataset: "lemm_stop". Corrects: 948, Totals: 964. Accuracy 0.983402489626556
Hams: 810, Spams: 154. Ratio to beat: 0.8402489626556017
Previously unseen 16361. Total Words 489255
Either way, the classifier is brutally effective.