Chapter 6. Bayesian Classification Models

We introduced the classification machine learning task in Chapter 4, Machine Learning Using Bayesian Inference, and said that the objective of classification is to assign a data record into one of the predetermined classes. Classification is one of the most studied machine learning tasks and there are several well-established state of the art methods for it. These include logistic regression models, support vector machines, random forest models, and neural network models. With sufficient labeled training data, these models can achieve accuracies above 95% in many practical problems.

Then, the obvious question is, why would you need to use Bayesian methods for classification? There are two answers to this question. One is that often it is difficult to get a large amount of labeled data for training. When there are hundreds or thousands of features in a given problem, one often needs a large amount of training data for these supervised methods to avoid overfitting. Bayesian methods can overcome this problem through Bayesian averaging and hence require only a small to medium size training data. Secondly, most of the methods, such as SVM or NN, are like black box machines. They will give you very accurate results, but little insight as to which variables are important for the example. Often, in many practical problems, for example, in the diagnosis of a disease, it is important to identify leading causes. Therefore, a black box approach would not be sufficient. Bayesian methods have an inherent feature called Automatic Relevance Determination (ARD) by which important variables in a problem can be identified.

In this chapter, two Bayesian classification models will be discussed. The first one is the popular Naïve Bayes method for text classification. The second is the Bayesian logistic regression model. Before we discuss each of these models, let's review some of the performance metrics that are commonly used in the classification task.

Performance metrics for classification

To understand the concepts easily, let's take the case of binary classification, where the task is to classify an input feature vector into one of the two states: -1 or 1. Assume that 1 is the positive class and -1 is the negative class. The predicted output contains only -1 or 1, but there can be two types of errors. Some of the -1 in the test set could be predicted as 1. This is called a false positive or type I error. Similarly, some of the 1 in the test set could be predicted as -1. This is called a false negative or type II error. These two types of errors can be represented in the case of binary classification as a confusion matrix as shown below.

Confusion Matrix

Predicted Class

Positive

Negative

Actual Class

Positive

TP

FN

Negative

FP

TN

From the confusion matrix, we can derive the following performance metrics:

  • Precision: Performance metrics for classification This gives the percentage of correct answers in the output predicted as positive
  • Recall: Performance metrics for classification This gives the percentage of positives in the test data set that have been correctly predicted
  • F-Score: Performance metrics for classification This is the geometric mean of precision and recall
  • True positive rate: Performance metrics for classification This is the same as recall
  • False positive rate: Performance metrics for classification This gives the percentage of negative classes classified as positive

Also, Tpr is called sensitivity and 1 - Fpr is called specificity of the classifier. A plot of Tpr versus Fpr (sensitivity versus 1 - specificity) is called an ROC curve (it stands for receiver operating characteristic curve). This is used to find the best threshold (operating point of the classifier) for deciding whether a predicted output (usually a score or probability) belongs to class 1 or -1.

Usually, the threshold is taken as the inflation point of the ROC curve that gives the best performance with the least false predictions. The area under the ROC curve or AUC is another measure of classifier performance. For a purely random model, the ROC curve will be a straight line along the diagonal. The corresponding value of AUC will be 0.5. Classifiers with AUC above 0.8 will be considered as good, though this very much depends on the problem to be solved.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.170.63