Introducing the Naive Bayes classifier

Naive Bayes is probably one of the most elegant machine learning algorithms out there that is of practical use. Despite its name, it is not that naive when you look at its classification performance. It proves to be quite robust to irrelevant features, which it kindly ignores. It learns fast and predicts equally so. It does not require lots of storage. So, why is it then called naive?

The naive was added to the account for one assumption that is required for Bayes to work optimally: all features must be independent of each other. This, however, is rarely the case for real-world applications. Nevertheless, it still returns very good accuracy in practice even when the independent assumption does not hold.

Getting to know the Bayes theorem

At its core, Naive Bayes classification is nothing more than keeping track of which feature gives evidence to which class. To ease our understanding, let us assume the following meanings for the variables that we will use to explain Naive Bayes:

Variable

Possible values

Meaning

Getting to know the Bayes theorem

"pos", "neg"

Class of a tweet (positive or negative)

Getting to know the Bayes theorem

Non-negative integers

Counting the occurrence of awesome in the tweet

Getting to know the Bayes theorem

Non-negative integers

Counting the occurrence of crazy in the tweet

During training, we learn the Naive Bayes model, which is the probability for a class Getting to know the Bayes theorem when we already know features Getting to know the Bayes theorem and Getting to know the Bayes theorem. This probability is written as Getting to know the Bayes theorem.

Since we cannot estimate this directly, we apply a trick, which was found out by Bayes:

Getting to know the Bayes theorem

If we substitute Getting to know the Bayes theorem with the probability of both features Getting to know the Bayes theorem and Getting to know the Bayes theorem occurring and think of Getting to know the Bayes theorem as being our class Getting to know the Bayes theorem, we arrive at the relationship that helps us to later retrieve the probability for the data instance belonging to the specified class:

Getting to know the Bayes theorem

This allows us to express Getting to know the Bayes theorem by means of the other probabilities:

Getting to know the Bayes theorem

We could also say that:

Getting to know the Bayes theorem

The prior and evidence values are easily determined:

  • Getting to know the Bayes theorem is the prior probability of class Getting to know the Bayes theorem without knowing about the data. This quantity can be obtained by simply calculating the fraction of all training data instances belonging to that particular class.
  • Getting to know the Bayes theorem is the evidence, or the probability of features Getting to know the Bayes theorem and Getting to know the Bayes theorem. This can be retrieved by calculating the fraction of all training data instances having that particular feature value.
  • The tricky part is the calculation of the likelihood Getting to know the Bayes theorem. It is the value describing how likely it is to see feature values Getting to know the Bayes theorem and Getting to know the Bayes theorem if we know that the class of the data instance is Getting to know the Bayes theorem. To estimate this we need a bit more thinking.

Being naive

From the probability theory, we also know the following relationship:

Being naive

This alone, however, does not help much, since we treat one difficult problem (estimating Being naive) with another one (estimating Being naive).

However, if we naively assume that Being naive and Being naive are independent from each other, Being naive simplifies to Being naive and we can write it as follows:

Being naive

Putting everything together, we get this quite manageable formula:

Being naive

The interesting thing is that although it is not theoretically correct to simply tweak our assumptions when we are in the mood to do so, in this case it proves to work astonishingly well in real-world applications.

Using Naive Bayes to classify

Given a new tweet, the only part left is to simply calculate the probabilities:

Using Naive Bayes to classify

We also need to choose the class Using Naive Bayes to classify having the higher probability. As for both classes the denominator, Using Naive Bayes to classify, is the same, so we can simply ignore it without changing the winner class.

Note, however, that we don't calculate any real probabilities any more. Instead, we are estimating which class is more likely given the evidence. This is another reason why Naive Bayes is so robust: it is not so much interested in the real probabilities, but only in the information which class is more likely to. In short, we can write it as follows:

Using Naive Bayes to classify

Here we are calculating the part after argmax for all classes of C ("pos" and "neg" in our case) and returning the class that results in the highest value.

But for the following example, let us stick to real probabilities and do some calculations to see how Naive Bayes works. For the sake of simplicity, we will assume that Twitter allows only for the two words mentioned earlier, awesome and crazy, and that we had already manually classified a handful of tweets:

Tweet

Class

awesome

Positive

awesome

Positive

awesome crazy

Positive

crazy

Positive

crazy

Negative

crazy

Negative

In this case, we have six total tweets, out of which four are positive and two negative, which results in the following priors:

Using Naive Bayes to classify

This means, without knowing anything about the tweet itself, we would be wise in assuming the tweet to be positive.

The piece that is still missing is the calculation of Using Naive Bayes to classify and Using Naive Bayes to classify, which are the probabilities for the two features Using Naive Bayes to classify and Using Naive Bayes to classify conditioned on class C.

This is calculated as the number of tweets in which we have seen that the concrete feature is divided by the number of tweets that have been labeled with the class of Using Naive Bayes to classify. Let's say we want to know the probability of seeing awesome occurring once in a tweet knowing that its class is "positive"; we would have the following:

Using Naive Bayes to classify

Since out of the four positive tweets three contained the word awesome, obviously the probability for not having awesome in a positive tweet is its inverse as we have seen only tweets with the counts 0 or 1:

Using Naive Bayes to classify

Similarly for the rest (omitting the case that a word is not occurring in a tweet):

Using Naive Bayes to classify

For the sake of completeness, we will also compute the evidence so that we can see real probabilities in the following example tweets. For two concrete values of Using Naive Bayes to classify and Using Naive Bayes to classify,we can calculate the evidence as follows:

Using Naive Bayes to classify

This denotation "" leads to the following values:

Using Naive Bayes to classify

Now we have all the data to classify new tweets. The only work left is to parse the tweet and give features to it.

Tweet

Using Naive Bayes to classify Using Naive Bayes to classify

Class probabilities

Classification

awesome

1

0

Using Naive Bayes to classify

Positive

crazy

0

1

Using Naive Bayes to classify

Negative

awesome crazy

1

1

Using Naive Bayes to classify

Positive

awesome text

0

0

Using Naive Bayes to classify

Undefined, because we have never seen these words in this tweet before

So far, so good. The classification of trivial tweets makes sense except for the last one, which results in a division by zero. How can we handle that?

Accounting for unseen words and other oddities

When we calculated the preceding probabilities, we actually cheated ourselves. We were not calculating the real probabilities, but only rough approximations by means of the fractions. We assumed that the training corpus would tell us the whole truth about the real probabilities. It did not. A corpus of only six tweets obviously cannot give us all the information about every tweet that has ever been written. For example, there certainly are tweets containing the word "text", it is just that we have never seen them. Apparently, our approximation is very rough, so we should account for that. This is often done in practice with "add-one smoothing".

Tip

Add-one smoothing is sometimes also referred to as additive smoothing or Laplace smoothing. Note that Laplace smoothing has nothing to do with Laplacian smoothing, which is related to smoothing of polygon meshes. If we do not smooth by one but by an adjustable parameter alpha greater than zero, it is called Lidstone smoothing.

It is a very simple technique, simply adding one to all counts. It has the underlying assumption that even if we have not seen a given word in the whole corpus, there is still a chance that our sample of tweets happened to not include that word. So, with add-one smoothing we pretend that we have seen every occurrence once more than we actually did. That means that instead of calculating the following:

Accounting for unseen words and other oddities

We now calculate:

Accounting for unseen words and other oddities

Why do we add 2 in the denominator? We have to make sure that the end result is again a probability. Therefore, we have to normalize the counts so that all probabilities sum up to one. As in our current dataset awesome, can occur either zero or one time, we have two cases. And indeed, we get 1 as the total probability:

Accounting for unseen words and other oddities

Similarly, we do this for the prior probabilities:

Accounting for unseen words and other oddities

Accounting for arithmetic underflows

There is yet another roadblock. In reality, we work with probabilities much smaller than the ones we have dealt with in the toy example. In reality, we also have more than two features, which we multiply with each other. This will quickly lead to the point where the accuracy provided by NumPy does not suffice anymore:

>>> import numpy as np
>>> np.set_printoptions(precision=20) # tell numpy to print out more digits (default is 8)
>>> np.array([2.48E-324])
array([ 4.94065645841246544177e-324]) 
>>> np.array([2.47E-324])
array([ 0.])

So, how probable is it that we will ever hit a number like 2.47E-324? To answer this, we just have to imagine a likelihood for the conditional probabilities of 0.0001 and then multiply 65 of them together (meaning that we have 65 low probable feature values) and you've been hit by the arithmetic underflow:

>>> x=0.00001
>>> x**64 # still fine
1e-320
>>> x**65 # ouch
0.0

A float in Python is typically implemented using double in C. To find out whether it is the case for your platform, you can check it as follows:

>>> import sys
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

To mitigate this, you could switch to math libraries such as mpmath (http://code.google.com/p/mpmath/) that allow arbitrary accuracy. However, they are not fast enough to work as a NumPy replacement.

Fortunately, there is a better way to take care of this, and it has to do with a nice relationship that we maybe still know from school:

Accounting for arithmetic underflows

If we apply it to our case, we get the following:

Accounting for arithmetic underflows

As the probabilities are in the interval between 0 and 1, the log of the probabilities lies in the interval -∞ and 0. Don't get irritated with that. Higher numbers are still a stronger indicator for the correct class—it is only that they are negative now.

Accounting for arithmetic underflows

There is one caveat though: we actually don't have log in the formula's nominator (the part preceding the fraction). We only have the product of the probabilities. In our case, luckily we are not interested in the actual value of the probabilities. We simply want to know which class has the highest posterior probability. We are lucky because if we find this:

Accounting for arithmetic underflows

Then we also have the following:

Accounting for arithmetic underflows

A quick look at the previous graph shows that the curve never goes down when we go from left to right. In short, applying the logarithm does not change the highest value. So, let us stick this into the formula we used earlier:

Accounting for arithmetic underflows

We will use this to retrieve the formula for two features that will give us the best class for real-world data that we will see in practice:

Accounting for arithmetic underflows

Of course, we will not be very successful with only two features, so let us rewrite it to allow the arbitrary number of features:

Accounting for arithmetic underflows

There we are, ready to use our first classifier from the scikit-learn toolkit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.251.56