Decision tree learning

Decision tree classifiers are attractive models if we care about interpretability. Like the name decision tree suggests, we can think of this model as breaking down our data by making decisions based on asking a series of questions.

Let's consider the following example where we use a decision tree to decide upon an activity on a particular day:

Decision tree learning

Based on the features in our training set, the decision tree model learns a series of questions to infer the class labels of the samples. Although the preceding figure illustrated the concept of a decision tree based on categorical variables, the same concept applies if our features are real numbers like in the Iris dataset. For example, we could simply define a cut-off value along the sepal width feature axis and ask a binary question "sepal width Decision tree learning cm?"

Using the decision algorithm, we start at the tree root and split the data on the feature that results in the largest information gain (IG), which will be explained in more detail in the following section. In an iterative process, we can then repeat this splitting procedure at each child node until the leaves are pure. This means that the samples at each node all belong to the same class. In practice, this can result in a very deep tree with many nodes, which can easily lead to overfitting. Thus, we typically want to prune the tree by setting a limit for the maximal depth of the tree.

Maximizing information gain – getting the most bang for the buck

In order to split the nodes at the most informative features, we need to define an objective function that we want to optimize via the tree learning algorithm. Here, our objective function is to maximize the information gain at each split, which we define as follows:

Maximizing information gain – getting the most bang for the buck

Here, f is the feature to perform the split, Maximizing information gain – getting the most bang for the buck and Maximizing information gain – getting the most bang for the buck are the dataset of the parent and jth child node, I is our impurity measure, Maximizing information gain – getting the most bang for the buck is the total number of samples at the parent node, and Maximizing information gain – getting the most bang for the buck is the number of samples in the jth child node. As we can see, the information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the child nodes, the larger the information gain. However, for simplicity and to reduce the combinatorial search space, most libraries (including scikit-learn) implement binary decision trees. This means that each parent node is split into two child nodes, Maximizing information gain – getting the most bang for the buck and Maximizing information gain – getting the most bang for the buck:

Maximizing information gain – getting the most bang for the buck

Now, the three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity (Maximizing information gain – getting the most bang for the buck), entropy (Maximizing information gain – getting the most bang for the buck), and the classification error (Maximizing information gain – getting the most bang for the buck). Let's start with the definition of entropy for all non-empty classes Maximizing information gain – getting the most bang for the buck:

Maximizing information gain – getting the most bang for the buck

Here, Maximizing information gain – getting the most bang for the buck is the proportion of the samples that belongs to class i for a particular node t. The entropy is therefore 0 if all samples at a node belong to the same class, and the entropy is maximal if we have a uniform class distribution. For example, in a binary class setting, the entropy is 0 if Maximizing information gain – getting the most bang for the buck or Maximizing information gain – getting the most bang for the buck. If the classes are distributed uniformly with Maximizing information gain – getting the most bang for the buck and Maximizing information gain – getting the most bang for the buck, the entropy is 1. Therefore, we can say that the entropy criterion attempts to maximize the mutual information in the tree.

Intuitively, the Gini impurity can be understood as a criterion to minimize the probability of misclassification:

Maximizing information gain – getting the most bang for the buck

Similar to entropy, the Gini impurity is maximal if the classes are perfectly mixed, for example, in a binary class setting (Maximizing information gain – getting the most bang for the buck):

Maximizing information gain – getting the most bang for the buck

However, in practice both the Gini impurity and entropy typically yield very similar results and it is often not worth spending much time on evaluating trees using different impurity criteria rather than experimenting with different pruning cut-offs.

Another impurity measure is the classification error:

Maximizing information gain – getting the most bang for the buck

This is a useful criterion for pruning but not recommended for growing a decision tree, since it is less sensitive to changes in the class probabilities of the nodes. We can illustrate this by looking at the two possible splitting scenarios shown in the following figure:

Maximizing information gain – getting the most bang for the buck

We start with a dataset Maximizing information gain – getting the most bang for the buck at the parent node Maximizing information gain – getting the most bang for the buck that consists of 40 samples from class 1 and 40 samples from class 2 that we split into two datasets Maximizing information gain – getting the most bang for the buck and Maximizing information gain – getting the most bang for the buck, respectively. The information gain using the classification error as a splitting criterion would be the same (Maximizing information gain – getting the most bang for the buck) in both scenario A and B:

Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck

However, the Gini impurity would favor the split in scenario Maximizing information gain – getting the most bang for the buck over scenario Maximizing information gain – getting the most bang for the buck, which is indeed more pure:

Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck

Similarly, the entropy criterion would favor scenario Maximizing information gain – getting the most bang for the buck over scenario Maximizing information gain – getting the most bang for the buck:

Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck
Maximizing information gain – getting the most bang for the buck

For a more visual comparison of the three different impurity criteria that we discussed previously, let's plot the impurity indices for the probability range [0, 1] for class 1. Note that we will also add in a scaled version of the entropy (entropy/2) to observe that the Gini impurity is an intermediate measure between entropy and the classification error. The code is as follows:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> def gini(p):
...     return (p)*(1 - (p)) + (1 - p)*(1 - (1-p))
>>> def entropy(p):
...     return - p*np.log2(p) - (1 - p)*np.log2((1 - p))
>>> def error(p):
...     return 1 - np.max([p, 1 - p])
>>> x = np.arange(0.0, 1.0, 0.01)
>>> ent = [entropy(p) if p != 0 else None for p in x]
>>> sc_ent = [e*0.5 if e else None for e in ent]
>>> err = [error(i) for i in x]
>>> fig = plt.figure()
>>> ax = plt.subplot(111)
>>> for i, lab, ls, c, in zip([ent, sc_ent, gini(x), err], 
...                   ['Entropy', 'Entropy (scaled)', 
...                   'Gini Impurity', 
...                   'Misclassification Error'],
...                   ['-', '-', '--', '-.'],
...                   ['black', 'lightgray',
...                      'red', 'green', 'cyan']):
...     line = ax.plot(x, i, label=lab, 
...                    linestyle=ls, lw=2, color=c)
>>> ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),
...           ncol=3, fancybox=True, shadow=False)
>>> ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
>>> ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
>>> plt.ylim([0, 1.1])
>>> plt.xlabel('p(i=1)')
>>> plt.ylabel('Impurity Index')
>>> plt.show()

The plot produced by the preceding code example is as follows:

Maximizing information gain – getting the most bang for the buck

Building a decision tree

Decision trees can build complex decision boundaries by dividing the feature space into rectangles. However, we have to be careful since the deeper the decision tree, the more complex the decision boundary becomes, which can easily result in overfitting. Using scikit-learn, we will now train a decision tree with a maximum depth of 3 using entropy as a criterion for impurity. Although feature scaling may be desired for visualization purposes, note that feature scaling is not a requirement for decision tree algorithms. The code is as follows:

>>> from sklearn.tree import DecisionTreeClassifier
>>> tree = DecisionTreeClassifier(criterion='entropy', 
...                               max_depth=3, random_state=0)
>>> tree.fit(X_train, y_train)
>>> X_combined = np.vstack((X_train, X_test))
>>> y_combined = np.hstack((y_train, y_test))
>>> plot_decision_regions(X_combined, y_combined, 
...                    classifier=tree, test_idx=range(105,150))
>>>plt.xlabel('petal length [cm]')
>>>plt.ylabel('petal width [cm]') 
>>> plt.legend(loc='upper left')
>>> plt.show()

After executing the preceding code example, we get the typical axis-parallel decision boundaries of the decision tree:

Building a decision tree

A nice feature in scikit-learn is that it allows us to export the decision tree as a .dot file after training, which we can visualize using the GraphViz program. This program is freely available at http://www.graphviz.org and supported by Linux, Windows, and Mac OS X.

First, we create the .dot file via scikit-learn using the export_graphviz function from the tree submodule, as follows:

>>> from sklearn.tree import export_graphviz
>>> export_graphviz(tree, 
...                 out_file='tree.dot',
...                 feature_names=['petal length', 'petal width'])

After we have installed GraphViz on our computer, we can convert the tree.dot file into a PNG file by executing the following command from the command line in the location where we saved the tree.dot file:

> dot -Tpng tree.dot -o tree.png
Building a decision tree

Looking at the decision tree figure that we created via GraphViz, we can now nicely trace back the splits that the decision tree determined from our training dataset. We started with 105 samples at the root and split it into two child nodes with 34 and 71 samples each using the petal width cut-off ≤ 0.75 cm. After the first split, we can see that the left child node is already pure and only contains samples from the Iris-Setosa class (entropy = 0). The further splits on the right are then used to separate the samples from the Iris-Versicolor and Iris-Virginica classes.

Combining weak to strong learners via random forests

Random forests have gained huge popularity in applications of machine learning during the last decade due to their good classification performance, scalability, and ease of use. Intuitively, a random forest can be considered as an ensemble of decision trees. The idea behind ensemble learning is to combine weak learners to build a more robust model, a strong learner, that has a better generalization error and is less susceptible to overfitting. The random forest algorithm can be summarized in four simple steps:

  1. Draw a random bootstrap sample of size n (randomly choose n samples from the training set with replacement).
  2. Grow a decision tree from the bootstrap sample. At each node:
    1. Randomly select d features without replacement.
    2. Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.
  3. Repeat the steps 1 to 2 k times.
  4. Aggregate the prediction by each tree to assign the class label by majority vote. Majority voting will be discussed in more detail in Chapter 7, Combining Different Models for Ensemble Learning.

There is a slight modification in step 2 when we are training the individual decision trees: instead of evaluating all features to determine the best split at each node, we only consider a random subset of those.

Although random forests don't offer the same level of interpretability as decision trees, a big advantage of random forests is that we don't have to worry so much about choosing good hyperparameter values. We typically don't need to prune the random forest since the ensemble model is quite robust to noise from the individual decision trees. The only parameter that we really need to care about in practice is the number of trees k (step 3) that we choose for the random forest. Typically, the larger the number of trees, the better the performance of the random forest classifier at the expense of an increased computational cost.

Although it is less common in practice, other hyperparameters of the random forest classifier that can be optimized—using techniques we will discuss in Chapter 5, Compressing Data via Dimensionality Reduction—are the size n of the bootstrap sample (step 1) and the number of features d that is randomly chosen for each split (step 2.1), respectively. Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance. In most implementations, including the RandomForestClassifier implementation in scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff. For the number of features d at each split, we want to choose a value that is smaller than the total number of features in the training set. A reasonable default that is used in scikit-learn and other implementations is Combining weak to strong learners via random forests, where m is the number of features in the training set.

Conveniently, we don't have to construct the random forest classifier from individual decision trees by ourselves; there is already an implementation in scikit-learn that we can use:

>>> from sklearn.ensemble import RandomForestClassifier
>>> forest = RandomForestClassifier(criterion='entropy',
...                                 n_estimators=10, 
...                                 random_state=1,
...                                 n_jobs=2)
>>> forest.fit(X_train, y_train)
>>> plot_decision_regions(X_combined, y_combined, 
...                classifier=forest, test_idx=range(105,150))
>>> plt.xlabel('petal length')
>>> plt.ylabel('petal width')
>>> plt.legend(loc='upper left')
>>> plt.show()

After executing the preceding code, we should see the decision regions formed by the ensemble of trees in the random forest, as shown in the following figure:

Combining weak to strong learners via random forests

Using the preceding code, we trained a random forest from 10 decision trees via the n_estimators parameter and used the entropy criterion as an impurity measure to split the nodes. Although we are growing a very small random forest from a very small training dataset, we used the n_jobs parameter for demonstration purposes, which allows us to parallelize the model training using multiple cores of our computer (here, two).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.197