Classifier part II

One of the main considerations is that a Naive Bayes classifier is a very simple program, and very difficult to get wrong. The entire program is in fact fewer than 100 lines. Let's look at it further.

We have sketched out so far the method Train, which will train the classifier on a given set of inputs. Here's how it looks:

func (c *Classifier) Train(examples []Example) {
  for _, ex := range examples {
    c.trainOne(ex)
  }
}

func (c *Classifier) trainOne(example Example) {
  d := make(doc, len(example.Document))
  for i, word := range example.Document {
    id := c.corpus.Add(word)
    d[i] = id
  }
  c.tfidfs[example.Class].Add(d)
  c.totals[example.Class]++
}

So here it's very clear that Train is an operation. But the function is structured in such a way that it would be trivial to parallelize the calls to c.trainOne. Within the context of this project, this wasn't necessary because the program was able to complete in under a second. However, if you are adapting this program for larger and more varied datasets, it may be instructive to parallelize the calls. The Classifier and tfidf.TFIDF structs have mutexes in them to allow for these sorts of extensions.

But what's more interesting is the trainOne example. Looking at it, all it seems to do is to add each word to the corpus, get its ID, and then add the ID to the doc type. doc, incidentally, is defined as such:

type doc []int

func (d doc) IDs() []int { return []int(d) }

This definition is done to fit into the interface that tfidf.TFIDF.Add accepts.

Let's look closer at the trainOne method. After making the doc, the words from the example are added to the corpus, while the IDs are then put into the doc. The doc is then added to the tfidf.TFIDF of the relevant class.

At first glance, there isn't much training here; we're just adding to the TF statistic.

The real magic happens in the Predict and Score methods.

Score is defined as such:

func (c *Classifier) Score(sentence []string) (scores [MAXCLASS]float64) {
  if !c.ready {
    c.Postprocess()
  }

  d := make(doc, len(sentence))
  for i, word := range sentence {
    id := c.corpus.Add(word)
    d[i] = id
  }

  priors := c.priors()

  // score per class
  for i := range c.tfidfs {
    score := math.Log(priors[i])
    // likelihood
    for _, word := range sentence {
      prob := c.prob(word, Class(i))
      score += math.Log(prob)
    }

    scores[i] = score
  }
  return
}

Given a tokenized sentence, we want to return the scores of each class. The idea is so that we can then look through the scores and find the class with the highest score:

func (c *Classifier) Predict(sentence []string) Class {
  scores := c.Score(sentence)
  return argmax(scores)
}

The Score function is worth a deeper look because that's where all the magic happens. First, we check the classifier is ready to score. An online machine learning system learns as new data comes in. This design means that the classifier cannot be used in an online fashion. All the training needs to be done up front. Once that training is done, the classifier will be locked, and won't train any further. Any new data will have to be part of a different run.

The Postprocess method is quite simple. Having recorded all the TF statistics, we now want to calculate the relative importance of each term to the documents. The tfidf package comes with a simple Log-based calculation of the IDF, but you can use any other IDF calculating function, as follows:

func (c *Classifier) Postprocess() {
  c.Lock()
  if c.ready {
    c.Unlock()
    return
  }

  var docs int
  for _, t := range c.tfidfs {
    docs += t.Docs
  }
  for _, t := range c.tfidfs {
    t.Docs = docs
    // t.CalculateIDF()
    for k, v := range t.TF {
      t.IDF[k] = math.Log1p(float64(t.Docs) / v)
    }
  }
  c.ready = true
  c.Unlock()
}

It is important to note that there is an update to the document count of each class: t.Docs = docs to the sum of all the documents seen. This was because as we were adding to the term frequency of each class, the tfidf.TFIDF struct wouldn't be aware of documents in other classes.

The reason we would want to calculate the IDF is to control the values a bit more.

Recall that the conditional probability can be written in the Bayes' theorem form:

Let's familiarize ourselves with the formula, once again by restating it in English, first by familiarizing ourselves with the terms:

: This is the prior probability of a class. If we have a pool of email messages and we randomly pick one out, what is the probability that the email is Ham or Spam? This largely corresponds to the dataset that we have. From the exploratory analysis, we know that the ratio between Ham and Spam is around 80:20.
: This is the likelihood of any random document belongs to a class. Because a document is comprised of individual words, we simply make a Naïve assumption that these words are independent of one another. So we want the probability of . Assuming the words are independent gives us the ability to simply multiply the probabilities.

So, to put it in English:

The conditional probability of a class being Ham given a document is the result of multiplying the prior probability of a document being ham and the likelihood that the document is Ham.

The observant reader may note that I have elided explanation of . The reason is simple. Consider what the probability of the document is. It's simply the multiplication of all the probabilities of a word in the corpus. It doesn't in anyway interact with the Class. It could well be a constant.

Furthermore, we run into another problem if we do use probabilities multiplied. Multiplying probabilities tend to yield smaller and smaller numbers. Computers do not have true rational numbers. float64 is a neat trick to mask the fundamental limitations that a computer has. You will frequently run into edge cases where the numbers become too small or too big when working on machine learning problems.

Fortunately, for this case, we have an elegant solution: We can elect to work in the log domain. Instead of considering the likelihood, we would consider the log likelihood. Upon taking logs, multiplication becomes addition. This allows us to keep it out of sight, and out of mind. For most cases, this project included, this is a fine choice. There may be cases where you wish to normalize the probabilities. Then, ignoring the denominator wouldn't work well.

Let's look at some code on how to write priors:

func (c *Classifier) priors() (priors []float64) {
  priors = make([]float64, MAXCLASS)
  var sum float64
  for i, total := range c.totals {
    priors[i] = total
    sum += total
  }
  for i := Ham; i < MAXCLASS; i++ {
    priors[int(i)] /= sum
  }
  return
}

The priors are essentially the proportion of Ham or Spam to the sum of all documents. This is fairly simple. To compute the likelihood, let's look at the loop in Score:

  // likelihood
  for _, word := range sentence {
    prob := c.prob(word, Class(i))
    score += math.Log(prob)
  }

We incorporate the likelihood function into the scoring function simply for ease of understanding. But the important takeaway of the likelihood function is that we're summing the probabilities of the word given the class. How do you calculate ? such as the following:

func (c *Classifier) prob(word string, class Class) float64 {
  id, ok := c.corpus.Id(word)
  if !ok {
    return tiny
  }

  freq := c.tfidfs[class].TF[id]
  idf := c.tfidfs[class].IDF[id]
  // idf := 1.0

  // a word may not appear at all in a class.
  if freq == 0 {
    return tiny
  }

  return freq * idf / c.totals[class]
}

First, we check whether the word has been seen. If the word hasn't been seen before, then we return a default value tiny—a small non-zero value that won't cause a division-by-zero error.

The probability of a word occurring in a class is simply its frequency divided by the number of words seen by the class. But we want to go a bit further; we want to control for frequent words being too important a factor in deciding the probability of the class, so we multiply it by the IDF that we had calculated earlier. And that's how you'd get the probabilities of the word given a class.

After we have the probability, we take the log of it, and then add it to the score.

Table of Contents for Classifier part II

Create new playlist

Sign In

Sign Up

Table of Contents for
Classifier part II