Classification using One-Vs-The-Rest approach

In this subsection, we will describe an example of performing multiclass classification using the OVTR algorithm by converting the problem into equivalent multiple binary classification problems. The OVTR strategy breaks down the problem and trains each binary classifier per class. In other words, the OVTR classifier strategy consists of fitting one binary classifier per class. It then treats all the samples of the current class as positive samples, and consequently other samples of other classifiers are treated as negatives samples.

This is a modular machine learning technique no doubt. However, on the downside, this strategy requires a base classifier from the multiclass family. The reason is that the classifier must produce a real value also called confidence scores instead of a prediction of the actual labels. The second disadvantage of this strategy is that if the dataset (aka training set) contains discrete class labels, these eventually lead to vague prediction results. In that case, multiple classes can be predicted for a single sample. To make the preceding discussion clearer, now let's see an example as follows.

Suppose that we have a set of 50 observations divided into three classes. Thus, we will use the same logic as before for selecting the negative examples too. For the training phase, let's have the following setting:

  • Classifier 1 has 30 positive examples and 20 negative examples
  • Classifier 2 has 36 positive examples and 14 negative examples
  • Classifier 3 has 14 positive examples and 24 negative examples

On the other hand, for the testing phase, suppose I have a new instance that need to be classified into one of the previous classes. Each of the three classifiers, of course, produces a probability with respect to the estimation This is an estimation of how low an instance belongs to the negative or positive examples in the classifier? In this case, we should always compare the probabilities of positive class in one versus the rest. Now that for N classes, we will have N probability estimates of the positive class for one test sample. Compare them, and whichever probability is the maximum of N probabilities belongs to that particular class. Spark provides multiclass to binary reduction with the OVTR algorithm, where the Logistic Regression algorithm is used as the base classifier.

Now let's see another example of a real dataset to demonstrate how Spark classifies all the features using OVTR algorithm. The OVTR classifier eventually predicts handwritten characters from the Optical Character Reader (OCR) dataset. However, before diving into the demonstration, let's explore the OCR dataset first to get the exploratory nature of the data. It is to be noted that when OCR software first processes a document, it divides the paper or any object into a matrix such that each cell in the grid contains a single glyph (also known different graphical shapes), which is just an elaborate way of referring to a letter, symbol, or number or any contextual information from the paper or the object.

To demonstrate the OCR pipeline, let's assume that the document contains only alpha characters in English that match glyphs to one of the 26 capital letters, that is, A to Z. We will use the OCR letter dataset from the UCI Machine Learning Data Repository. The dataset was denoted by W. Frey and D. J. Slate. While exploring the dataset, you should observe 20,000 examples of 26 English capital letters. Letter written in capital letters are available as printed using 20 different, randomly reshaped and distorted black and white fonts as glyphs of different shapes. In short, predicting all the characters from 26 alphabets turns the problem itself into a multiclass classification problem with 26 classes. Consequently, a binary classifier will not be able to serve our purpose.

Figure 1: Some of the printed glyphs (Source: Letter recognition using Holland-style adaptive classifiers, ML, V. 6, p. 161-182, by W. Frey and D.J. Slate [1991])

The preceding figure shows the images that I explained earlier.The dataset provides an example of some of the printed glyphs distorted in this way; therefore, the letters are computationally challenging for a computer to identify. Yet, these glyphs are easily recognized by a human being. The following figure shows the statistical attributes of the top 20 rows:

Figure 2: The snapshot of the dataset shown as the data frame
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.129.92