The mechanics of Naïve Bayes

Let's start with understanding the magic behind the algorithm—how Naïve Bayes works. Given a data sample x with n features, x₁, x₂, …, x_n (x represents a feature vector and x = (x₁, x₂, …, x_n)), the goal of Naïve Bayes is to determine the probabilities that this sample belongs to each of K possible classes y₁, y₂, …, y_K, that is P( y_k |x) or P(x₁, x₂, …, x_n), where k = 1, 2, …, K. It looks no different from what we have just dealt with: x, or x₁, x₂, …, x_n, is a joint event that the sample has features with values x₁, x₂, …, x_n respectively, y_k is an event that the sample belongs to class k. We can apply Bayes' theorem right away:

Let's look at each component in detail:

P (y_k) portrays how classes are distributed, provided with no further knowledge of observation features. Thus, it is also called prior in Bayesian probability terminology. Prior can be either predetermined (usually in a uniform manner where each class has an equal chance of occurrence) or learned from a set of training samples.
P( y_k |x), in contrast to prior P (y_k), is the posterior with extra knowledge of observation.
P(x | y_k), or P(x₁, x₂, …, x_n|y_k) is the joint distribution of n features, given the sample belongs to class y_k. This is how likely the features with such values co-occur. This is named likelihood in Bayesian terminology. Obviously, the likelihood will be difficult to compute as the number of features increases. In Naïve Bayes, this is solved thanks to the feature independence assumption. The joint conditional distribution of n features can be expressed as the joint product of individual feature conditional distributions:

Each conditional distribution can be efficiently learned from a set of training samples.

P(x), also called evidence, solely depends on the overall distribution of features, which is not specific to certain classes and is therefore constant. As a result, posterior is proportional to prior and likelihood:

The following diagram summarizes how a Naïve Bayes classification model is trained and applied to new data:

Let's see a Naïve Bayes classifier in action through an example before we jump to its implementations. Given four (pseudo) emails shown in the following table, we are asked to predict how likely it is that a new email is spam:

First, define the S and NS events as an email being spam or not spam respectively. From the training set, we can easily get the following:

Or we can also impose an assumption of prior that P (S)= 1%.

To calculate P(S |x) where x = (free, setup, meeting, free), the first step is to compute P(free |S), P(setup |S), and P(meeting |S) based on the training set; that is, the ratio of the occurrence of a term to that of all terms in the S class. However, as the term free was not seen in the NS class training set, P(free |NS) will become 0, so will P(x |NS) and P(NS |x). It will be predicted as spam email, falsely. To eliminate the zero multiplication factor, the unseen term, we usually set each term frequency an initial value 1, that is, we start counting term occurrence from one. This technique is also called Laplace smoothing. With this amendment, now we have the following:

Here, 9 is the total number of term occurrences from the S class (3+3+3), 4 is the total term occurrences from the NS class, and 6 comes from the 1 additional count per term (click, win, prize, meeting, setup, free). Similarly, we can compute the following:

Hence we have the following formula:

Also, remember this:

So, finally, we have the following:

There is 47.1% chance that the new email is spam.

Table of Contents for The mechanics of Na&#xEF;ve Bayes

Create new playlist

Sign In

Sign Up

Table of Contents for
The mechanics of Naïve Bayes