The conditional independence assumption

The assumption that is making the model both tractable and justifiably calling it Naive is that the features are independent conditional on the outcome. To illustrate, let's classify an email with the three words Send money now so that Bayes' theorem becomes the following:

Formally, the assumption that the three words are conditionally independent means that the probability of observing send is not affected by the presence of the other terms given the mail is spam; in other words, P(send | money, now, spam) = P(send | spam). As a result, we can simplify the likelihood function:

Using the naive conditional independence assumption, each term in the numerator is straightforward to compute as relative frequencies from the training data. The denominator is constant across classes and can be ignored when posterior probabilities need to be compared rather than calibrated. The prior probability becomes less relevant as the number of factors—that is, features—increases.

In summary, the advantages of the Naive Bayes model are fast training and prediction because the number of parameters is linear in the number of features, and their estimation has a closed-form solution (based on training data frequencies) rather than expensive iterative optimization. It is also intuitive and somewhat interpretable, does not require hyperparameter tuning, and is relatively robust to irrelevant features given a sufficient signal.

However, when the independence assumption does not hold, and text classification depends on combinations of features or features are correlated, the model will perform poorly.

Table of Contents for The conditional independence assumption

Create new playlist

Sign In

Sign Up

Table of Contents for
The conditional independence assumption