Choosing a classifier

These are just four of the most popular classifiers out there, but there are many more to choose from. Although some classification mechanisms perform better on some types of datasets than others, it can be hard to develop an intuition for exactly the ones they are suitable for. In order to help with this, we will be examining the efficacy of our four classifiers on four different two-dimensional made-up datasets—each with a vastly different optimal decision boundary. In doing so, we will learn more about the characteristics of each classifier and have a better sense of the kinds of data they might be better suited for.

The four datasets are depicted in Figure 9.11:

Choosing a classifier

Figure 9.11: A plot depicting the class patterns of our four illustrative and contrived datasets

The vertical decision boundary

The first contrived dataset we will be looking at is the one on the top-left panel of Figure 9.11. This is a relatively simple classification problem, because, just by visual inspection, you can tell that the optimal decision boundary is a vertical line. Let's see each of our classifiers fair on this data set:

The vertical decision boundary

Figure 9.12: A plot of the decision boundaries of our four classifiers on our first contrived dataset

As you can see, all of our classifiers performed well on this simple data set; all of the methods find an appropriate straight vertical line that is most representative of the class division. In general, logistic regression is great for linear decision boundaries. Decision trees also work well for straight decision boundaries, as long as the boundaries are orthogonal to the axes! Observe the next dataset.

The diagonal decision boundary

The second dataset sports an optimal decision boundary that is a diagonal line—one that is not orthogonal to the axes. Here, we start to see some cool behavior from certain classifiers.

The diagonal decision boundary

Figure 9.13: A plot of the decision boundaries of our four classifiers on our second contrived dataset

Though all four classifiers were reasonably effective in this data set's classification, we start to see each of the classifiers' personality come out. First, the k-NN creates a boundary that closely approximates the optimal one. The logistic regression, amazingly, throws a perfect linear boundary at the exact right spot.

The decision tree's boundary is curious; it is made up of perpendicular zig-zags. Though the optimal decision boundary is linear in the input space, the decision tree can't capture its essence. This is because decision trees only split on a function of one variable at a time. Thus, datasets with complex interactions may not be the best ones to attack with a decision tree.

Finally, the random forest, being composed of sufficiently varied decision trees, is able to capture the spirit of the optimal boundary.

The crescent decision boundary

This third dataset, depicted in the bottom-left panel of Figure 9.11, exhibits a very non-linear classification pattern:

The crescent decision boundary

Figure 9.14: A plot of the decision boundaries of our four classifiers on our third contrived dataset

In the preceding figure, our top performers are k-NN—which is highly effective with non-linear boundaries—and random forest—which is similarly effective. The decision tree is a little too jagged to compete at the top level. But the real loser here is logistic regression. Because logistic regression returns linear decision boundaries, it is ineffective at classifying these data.

To be fair, with a little finesse, logistic regression can handle these boundaries, too, as we'll see in the last example. However, in highly non-linear situations, where the nature of the non-linear boundary is unknown—or unknowable—logistic regression is often outperformed by other classifiers that natively handle these situations with ease.

The circular decision boundary

The last dataset we will be looking at, like the previous one, contains a non-linear classification pattern.

The circular decision boundary

Figure 9.15: A plot of the decision boundaries of our four classifiers on our fourth contrived dataset

Again, just like in the last case, the winners are k-NN and random forest, followed by the decision tree with its jagged edges. And, again, the logistic regression unproductively throws a linear boundary at a distinctively not-linear pattern. However, stating that logistic regression is unsuitable for problems of this type is both negligent and dead wrong.

With a slight change in the incantation of the logistic regression, the whole game is changed, and logistic regression becomes the clear winner:

  > model <- glm(factor(dep.var) ~ ind.var1 +
  +               I(ind.var1^2) + ind.var2 + I(ind.var2^2), 
  +               data=this, family=binomial(logit))
The circular decision boundary

Figure 9.16: A second-order (quadratic) logistic regression decision boundary

In the preceding figure, instead of modeling the binary dependent variable (dep.var) as a linear combination of solely the two independent variables (ind.var1 and ind.var2), we model it as a function of those two variables and those two variables squared. The result is still a linear combination of the inputs (before the inverse link function), but now the inputs contain non-linear transformations of the other original inputs. This general technique is called polynomial regression and can be used to create a wide variety of non-linear boundaries. In this example, just squaring the inputs (resulting in a quadratic polynomial) outputs a classification circle that exactly matches the optimal decision boundary, as you can see in Figure 9.16. Cubing the original inputs (creating a cubic polynomial) suffices to describe the boundary in the previous example.

In fact, a logistic regression containing polynomial terms of arbitrarily large order can fit any decision boundary—no matter how non-linear and complicated. Careful, though! Using high order polynomials is a great way to make sure you overfit your data.

My general advice is to only use polynomial regression for cases where you know a priori what polynomial form your boundaries take on—like an ellipse! If you must experiment, keep a close eye on your cross-validated error rate to make sure you are not fooling yourself into thinking that you are doing the right thing taking on more and more polynomial terms.

The circular decision boundary
The circular decision boundary
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.15.217