Example of voting classifiers with Scikit-Learn

In this example, we are going to employ the MNIST handwritten digits dataset. As the concept is very simple, our goal is to show how to combine two completely different estimators to improve the overall cross-validation accuracy. For this reason, we have selected a Logistic Regression and a decision tree, which are structurally different. In particular, while the former is a linear model that works with the whole vectors, the latter is a feature-wise estimator that can support the decision only in particular cases (images are not made up of semantically consistent features, but the over-complexity of a Decision Tree can help with particular samples which are very close to the separation hyperplane and, therefore, more difficult to classify with a linear method). The first step is loading and normalizing the dataset (this operation is not important with a Decision Tree, but has a strong impact on the performances of a Logistic Regression):

import numpy as np

from sklearn.datasets import load_digits

X, Y = load_digits(return_X_y=True)
X /= np.max(X)

At this point, we need to evaluate the performances of both estimators individually:

import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

dt = DecisionTreeClassifier(criterion='entropy', random_state=1000)
print(np.mean(cross_val_score(dt, X, Y, cv=10)))
0.830880960443

lr = LogisticRegression(C=2.0, random_state=1000)
print(np.mean(cross_val_score(lr, X, Y, cv=10)))
0.937021649942

As expected, the Logistic Regression (∼94% accuracy) outperforms the decision tree (83% accuracy); therefore, a hard-voting strategy is not the best choice. As we trust the Logistic Regression more, we can employ soft voting with a weight vector set to (0.9, 0.1). The class VotingClassifier accepts a list of tuples (name of the estimator, instance) that must be supplied through the estimators parameter. The strategy can be specified using parameter voting (it can be either "soft" or "hard") and the optional weights, using the parameter with the same name:

import numpy as np

from sklearn.ensemble import VotingClassifier

vc = VotingClassifier(estimators=[
    ('LR', LogisticRegression(C=2.0, random_state=1000)),
    ('DT', DecisionTreeClassifier(criterion='entropy', random_state=1000))], 
     voting='soft', weights=(0.9, 0.1))

print(np.mean(cross_val_score(vc, X, Y, cv=10)))
0.944835154373

Using a soft-voting strategy, the estimator is able to outperform Logistic Regression by reducing the global uncertainty. I invite the reader to test this algorithm with other datasets, using more estimators, and try to find out the optimal combination using both the hard and soft voting strategies.

Table of Contents for Example of voting classifiers with Scikit-Learn

Create new playlist

Sign In

Sign Up

Table of Contents for
Example of voting classifiers with Scikit-Learn