Leveraging weak learners via adaptive boosting

In this section about ensemble methods, we will discuss boosting with a special focus on its most common implementation, AdaBoost (short for Adaptive Boosting).

Note

The original idea behind AdaBoost was formulated by Robert Schapire in 1990 (R. E. Schapire. The Strength of Weak Learnability. Machine learning, 5(2):197–227, 1990). After Robert Schapire and Yoav Freund presented the AdaBoost algorithm in the Proceedings of the Thirteenth International Conference (ICML 1996), AdaBoost became one of the most widely used ensemble methods in the years that followed (Y. Freund, R. E. Schapire, et al. Experiments with a New Boosting Algorithm. In ICML, volume 96, pages 148–156, 1996). In 2003, Freund and Schapire received the Goedel Prize for their groundbreaking work, which is a prestigious prize for the most outstanding publications in the computer science field.

In boosting, the ensemble consists of very simple base classifiers, also often referred to as weak learners, that have only a slight performance advantage over random guessing. A typical example of a weak learner would be a decision tree stump. The key concept behind boosting is to focus on training samples that are hard to classify, that is, to let the weak learners subsequently learn from misclassified training samples to improve the performance of the ensemble. In contrast to bagging, the initial formulation of boosting, the algorithm uses random subsets of training samples drawn from the training dataset without replacement. The original boosting procedure is summarized in four key steps as follows:

  1. Draw a random subset of training samples Leveraging weak learners via adaptive boosting without replacement from the training set Leveraging weak learners via adaptive boosting to train a weak learner Leveraging weak learners via adaptive boosting.
  2. Draw second random training subset Leveraging weak learners via adaptive boosting without replacement from the training set and add 50 percent of the samples that were previously misclassified to train a weak learner Leveraging weak learners via adaptive boosting.
  3. Find the training samples Leveraging weak learners via adaptive boosting in the training set Leveraging weak learners via adaptive boosting on which Leveraging weak learners via adaptive boosting and Leveraging weak learners via adaptive boosting disagree to train a third weak learner Leveraging weak learners via adaptive boosting.
  4. Combine the weak learners Leveraging weak learners via adaptive boosting, Leveraging weak learners via adaptive boosting, and Leveraging weak learners via adaptive boosting via majority voting.

As discussed by Leo Breiman (L. Breiman. Bias, Variance, and Arcing Classifiers. 1996), boosting can lead to a decrease in bias as well as variance compared to bagging models. In practice, however, boosting algorithms such as AdaBoost are also known for their high variance, that is, the tendency to overfit the training data (G. Raetsch, T. Onoda, and K. R. Mueller. An Improvement of Adaboost to Avoid Overfitting. In Proc. of the Int. Conf. on Neural Information Processing. Citeseer, 1998).

In contrast to the original boosting procedure as described here, AdaBoost uses the complete training set to train the weak learners where the training samples are reweighted in each iteration to build a strong classifier that learns from the mistakes of the previous weak learners in the ensemble. Before we dive deeper into the specific details of the AdaBoost algorithm, let's take a look at the following figure to get a better grasp of the basic concept behind AdaBoost:

Leveraging weak learners via adaptive boosting

To walk through the AdaBoost illustration step by step, we start with subfigure 1, which represents a training set for binary classification where all training samples are assigned equal weights. Based on this training set, we train a decision stump (shown as a dashed line) that tries to classify the samples of the two classes (triangles and circles) as well as possible by minimizing the cost function (or the impurity score in the special case of decision tree ensembles). For the next round (subfigure 2), we assign a larger weight to the two previously misclassified samples (circles). Furthermore, we lower the weight of the correctly classified samples. The next decision stump will now be more focused on the training samples that have the largest weights, that is, the training samples that are supposedly hard to classify. The weak learner shown in subfigure 2 misclassifies three different samples from the circle-class, which are then assigned a larger weight as shown in subfigure 3. Assuming that our AdaBoost ensemble only consists of three rounds of boosting, we would then combine the three weak learners trained on different reweighted training subsets by a weighted majority vote, as shown in subfigure 4.

Now that have a better understanding behind the basic concept of AdaBoost, let's take a more detailed look at the algorithm using pseudo code. For clarity, we will denote element-wise multiplication by the cross symbol Leveraging weak learners via adaptive boosting and the dot product between two vectors by a dot symbol Leveraging weak learners via adaptive boosting, respectively. The steps are as follows:

  1. Set weight vector Leveraging weak learners via adaptive boosting to uniform weights where Leveraging weak learners via adaptive boosting
  2. For j in m boosting rounds, do the following:
  3. Train a weighted weak learner: Leveraging weak learners via adaptive boosting).
  4. Predict class labels: Leveraging weak learners via adaptive boosting.
  5. Compute weighted error rate: Leveraging weak learners via adaptive boosting.
  6. Compute coefficient: Leveraging weak learners via adaptive boosting.
  7. Update weights: Leveraging weak learners via adaptive boosting.
  8. Normalize weights to sum to 1: Leveraging weak learners via adaptive boosting.
  9. Compute final prediction: Leveraging weak learners via adaptive boosting.

Note that the expression Leveraging weak learners via adaptive boosting in step 5 refers to a vector of 1s and 0s, where a 1 is assigned if the prediction is incorrect and 0 is assigned otherwise.

Although the AdaBoost algorithm seems to be pretty straightforward, let's walk through a more concrete example using a training set consisting of 10 training samples as illustrated in the following table:

Leveraging weak learners via adaptive boosting

The first column of the table depicts the sample indices of the training samples 1 to 10. In the second column, we see the feature values of the individual samples assuming this is a one-dimensional dataset. The third column shows the true class label Leveraging weak learners via adaptive boosting for each training sample Leveraging weak learners via adaptive boosting, where Leveraging weak learners via adaptive boosting. The initial weights are shown in the fourth column; we initialize the weights to uniform and normalize them to sum to one. In the case of the 10 sample training set, we therefore assign the 0.1 to each weight Leveraging weak learners via adaptive boosting in the weight vector Leveraging weak learners via adaptive boosting. The predicted class labels Leveraging weak learners via adaptive boosting are shown in the fifth column, assuming that our splitting criterion is Leveraging weak learners via adaptive boosting. The last column of the table then shows the updated weights based on the update rules that we defined in the pseudocode.

Since the computation of the weight updates may look a little bit complicated at first, we will now follow the calculation step by step. We start by computing the weighted error rate Leveraging weak learners via adaptive boosting as described in step 5:

Leveraging weak learners via adaptive boosting

Next we compute the coefficient Leveraging weak learners via adaptive boosting (shown in step 6), which is later used in step 7 to update the weights as well as for the weights in majority vote prediction (step 10):

Leveraging weak learners via adaptive boosting

After we have computed the coefficient Leveraging weak learners via adaptive boosting we can now update the weight vector using the following equation:

Leveraging weak learners via adaptive boosting

Here, Leveraging weak learners via adaptive boosting is an element-wise multiplication between the vectors of the predicted and true class labels, respectively. Thus, if a prediction Leveraging weak learners via adaptive boosting is correct, Leveraging weak learners via adaptive boosting will have a positive sign so that we decrease the ith weight since Leveraging weak learners via adaptive boosting is a positive number as well:

Leveraging weak learners via adaptive boosting

Similarly, we will increase the ith weight if Leveraging weak learners via adaptive boosting predicted the label incorrectly like this:

Leveraging weak learners via adaptive boosting

Or like this:

Leveraging weak learners via adaptive boosting

After we update each weight in the weight vector, we normalize the weights so that they sum up to 1 (step 8):

Leveraging weak learners via adaptive boosting

Here, Leveraging weak learners via adaptive boosting.

Thus, each weight that corresponds to a correctly classified sample will be reduced from the initial value of 0.1 to Leveraging weak learners via adaptive boosting for the next round of boosting. Similarly, the weights of each incorrectly classified sample will increase from 0.1 to Leveraging weak learners via adaptive boosting.

This was AdaBoost in a nutshell. Skipping to the more practical part, let's now train an AdaBoost ensemble classifier via scikit-learn. We will use the same Wine subset that we used in the previous section to train the bagging meta-classifier. Via the base_estimator attribute, we will train the AdaBoostClassifier on 500 decision tree stumps:

>>> from sklearn.ensemble import AdaBoostClassifier
>>> tree = DecisionTreeClassifier(criterion='entropy',
...                               max_depth=None,
...                               random_state=0)
>>> ada = AdaBoostClassifier(base_estimator=tree,
...                          n_estimators=500, 
...                          learning_rate=0.1,
...                          random_state=0)
>>> tree = tree.fit(X_train, y_train)
>>> y_train_pred = tree.predict(X_train)
>>> y_test_pred = tree.predict(X_test)
>>> tree_train = accuracy_score(y_train, y_train_pred)
>>> tree_test = accuracy_score(y_test, y_test_pred)
>>> print('Decision tree train/test accuracies %.3f/%.3f'
...       % (tree_train, tree_test))
Decision tree train/test accuracies 0.845/0.854

As we can see, the decision tree stump tends to underfit the training data in contrast with the unpruned decision tree that we saw in the previous section:

>>> ada = ada.fit(X_train, y_train)
>>> y_train_pred = ada.predict(X_train)
>>> y_test_pred = ada.predict(X_test)
>>> ada_train = accuracy_score(y_train, y_train_pred) 
>>> ada_test = accuracy_score(y_test, y_test_pred) 
>>> print('AdaBoost train/test accuracies %.3f/%.3f'
...       % (ada_train, ada_test))
AdaBoost train/test accuracies 1.000/0.875

As we can see, the AdaBoost model predicts all class labels of the training set correctly and also shows a slightly improved test set performance compared to the decision tree stump. However, we also see that we introduced additional variance by our attempt to reduce the model bias.

Although we used another simple example for demonstration purposes, we can see that the performance of the AdaBoost classifier is slightly improved compared to the decision stump and achieved very similar accuracy scores to the bagging classifier that we trained in the previous section. However, we should note that it is considered bad practice to select a model based on the repeated usage of the test set. The estimate of the generalization performance may be too optimistic, which we discussed in more detail in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.

Finally, let's check what the decision regions look like:

>>> x_min = X_train[:, 0].min() - 1
>>> x_max = X_train[:, 0].max() + 1
>>> y_min = X_train[:, 1].min() - 1
>>> y_max = X_train[:, 1].max() + 1
>>> xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
...                      np.arange(y_min, y_max, 0.1))
>>> f, axarr = plt.subplots(1, 2, 
...                         sharex='col', 
...                         sharey='row', 
...                         figsize=(8, 3))
>>> for idx, clf, tt in zip([0, 1],
...                         [tree, ada],
...                         ['Decision Tree', 'AdaBoost']):
...     clf.fit(X_train, y_train)   
...     Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
...     Z = Z.reshape(xx.shape)
...     axarr[idx].contourf(xx, yy, Z, alpha=0.3)
...     axarr[idx].scatter(X_train[y_train==0, 0], 
...                        X_train[y_train==0, 1], 
...                        c='blue', 
...                        marker='^')
...     axarr[idx].scatter(X_train[y_train==1, 0], 
...                        X_train[y_train==1, 1], 
...                        c='red',
...                        marker='o')
... axarr[idx].set_title(tt)
... axarr[0].set_ylabel('Alcohol', fontsize=12)
>>> plt.text(10.2, -1.2, 
...          s=Hue', 
...          ha='center', 
...          va='center', 
...          fontsize=12)    
>>> plt.show()

By looking at the decision regions, we can see that the decision boundary of the AdaBoost model is substantially more complex than the decision boundary of the decision stump. In addition, we note that the AdaBoost model separates the feature space very similarly to the bagging classifier that we trained in the previous section.

Leveraging weak learners via adaptive boosting

As concluding remarks about ensemble techniques, it is worth noting that ensemble learning increases the computational complexity compared to individual classifiers. In practice, we need to think carefully whether we want to pay the price of increased computational costs for an often relatively modest improvement of predictive performance.

An often-cited example of this trade-off is the famous $1 Million Netflix Prize, which was won using ensemble techniques. The details about the algorithm were published in A. Toescher, M. Jahrer, and R. M. Bell. The Bigchaos Solution to the Netflix Grand Prize. Netflix prize documentation, 2009 (which is available at http://www.stat.osu.edu/~dmsl/GrandPrize2009_BPC_BigChaos.pdf). Although the winning team received the $1 million prize money, Netflix never implemented their model due to its complexity, which made it unfeasible for a real-world application. To quote their exact words (http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html):

"[…] additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment."

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.122.82