Supervised learning

An important subfield of machine learning is supervised learning. In supervised learning, we try to learn from a set of labeled training data; that is, every data sample has a desired target value or true output value. These target values could correspond to the continuous output of a function (such as y in y = sin(x)), or to more abstract and discrete categories (such as cat or dog). If we are dealing with continuous output, the process is called regression, and if we are dealing with discrete output, the process is called classification. Predicting housing prices from sizes of houses is an example of regression. Predicting the species from the color of a fish would be classification. In this chapter, we will focus on classification using SVMs.

The training procedure

As an example, we may want to learn what cats and dogs look like. To make this a supervised learning task, we will have to create a database of pictures of both cats and dogs (also called a training set), and annotate each picture in the database with its corresponding label: cat or dog. The task of the program (in literature, it is often referred to as the learner) is then to infer the correct label for each of these pictures (that is, for each picture, predict whether it is a picture of a cat or a dog). Based on these predictions, we derive a score of how well the learner performed. The score is then used to change the parameters of the learner in order to improve the score over time.

This procedure is outlined in the following figure:

The training procedure

Training data is represented by a set of features. For real-life classification tasks, these features are rarely the raw pixel values of an image, since these tend not to represent the data well. Often, the process of finding the features that best describe the data is an essential part of the entire learning task (also referred to as feature selection or feature engineering). That is why it is always a good idea to deeply study the statistics and appearances of the training set that you are working with before even thinking about setting up a classifier.

As you are probably aware, there is an entire zoo of learners, cost functions, and learning algorithms out there. These make up the core of the learning procedure. The learner (for example, a linear classifier, support vector machine, or decision tree) defines how input features are converted into a score or cost function (for example, mean-squared error, hinge loss, or entropy), whereas the learning algorithm (for example, gradient descent and backpropagation for neural networks) defines how the parameters of the learner are changed over time.

The training procedure in a classification task can also be thought of as finding an appropriate decision boundary, which is a line that best partitions the training set into two subsets, one for each class. For example, consider training samples with only two features (x and y values) and a corresponding class label (positive, +, or negative, ). At the beginning of the training procedure, the classifier tries to draw a line to separate all positives from all negatives. As the training progresses, the classifier sees more and more data samples. These are used to update the decision boundary, as illustrated in the following figure:

The training procedure

Compared to this simple illustration, an SVM tries to find the optimal decision boundary in a high-dimensional space, so the decision boundary can be more complex than a straight line.

The testing procedure

In order for a trained classifier to be of any practical value, we need to know how it performs when applied to a never-seen-before data sample (also called generalization). To stick to our example shown earlier, we want to know which class the classifier predicts when we present it with a previously unseen picture of a cat or a dog.

More generally speaking, we want to know which class the ? sign in the following figure corresponds to, based on the decision boundary we learned during the training phase:

The testing procedure

You can see why this is a tricky problem. If the location of the question mark were more to the left, we would be certain that the corresponding class label is +. However, in this case, there are several ways to draw the decision boundary such that all the + signs are to the left of it and all the signs are to the right of it, as illustrated in this figure:

The testing procedure

The label of ? thus depends on the exact decision boundary that was derived during training. If the ? sign in the preceding figure is actually a , then only one decision boundary (the leftmost) would get the correct answer. A common problem is that training results in a decision boundary that works "too well" on the training set (also known as overfitting), but makes a lot of mistakes when applied to unseen data. In that case, it is likely that the learner imprinted details that are specific to the training set on the decision boundary, instead of revealing general properties about the data that might also be true for unseen data.

Note

A common technique for reducing the effect of overfitting is called regularization.

Long story short, the problem always comes back to finding the boundary that best splits, not only the training, but also the test set. That is why the most important metric for a classifier is its generalization performance (that is, how well it classifies data not seen in the training phase).

A classifier base class

From the insights gained in the preceding content, you are now able to write a simple base class suitable for all possible classifiers. You can think of this class as a blueprint or recipe that will apply to all classifiers that we are yet to design (we did this with the BaseLayout class in Chapter 1, Fun with Filters). In order to create an abstract base class (ABC) in Python, we need to include the ABCMeta module:

from abc import ABCMeta

This allows us to register the class as a metaclass:

class Classifier:
    """Abstract base class for all classifiers"""
    __metaclass__ = ABCMeta

Recall that an abstract class has at least one abstract method. An abstract method is akin to specifying that a certain method must exist, but we are not yet sure what it should look like. We now know that a classifier in its most generic form should contain a method for training, wherein a model is fitted to the training data, and for testing, wherein the trained model is evaluated by applying it to the test data:

    @abstractmethod
    def fit(self, X_train, y_train):
        pass

    @abstractmethod
    def evaluate(self, X_test, y_test, visualize=False):
        pass

Here, X_train and X_test correspond to the training and test data, respectively, where each row represents a sample, and each column is a feature value of that sample. The training and test labels are passed as y_train and y_test vectors, respectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.162.49