Decision trees

Finally, we arrive at our last model in this chapter: decision trees. Similar to linear models, decision trees are spread out and good—although not as easy to digest—for interpretation. The core idea behind trees is very different from linear models but is easy to comprehend. To estimate the outcome, the model generates a binary tree—a tree-like diagram—where each note (intersection) represents a single question, based on the known features, with a yes/no answer. Usually, it is something like the number of casualties is smaller than 1,000. At the end of each branch, a corresponding estimate is attached. The tree is generated so that the average accuracy of predictions is maximized.

As the depth of the tree can vary, decision trees can, in theory, predict 100% accuracy on the training set—simply by asking questions until there is only one record on each end. This will, however, decrease the accuracy of any external data. This phenomenon is called overfitting. We will talk about how to mitigate it in the next chapter.

Another weak spot of decision trees is that they work on each feature separately—no more of the interactions we enjoyed with KNN. Hence, it is way harder for them to detect any interaction between multiple features. Creating a smart set of features, for example, a ratio of axis troops to allied troops, might lead to a significant gain in performance. For the same reasons, decision trees do not care about scaling, as they don't compare features to each other in any way.

Let's try building a decision tree on our dataset—the same data we ran for the KNN:

>>> from sklearn.tree import DecisionTreeClassifier
>>> tree_model = DecisionTreeClassifier(random_state=2019)
>>> tree_model.fit(Xtrain, ytrain)
>>> accuracy_score(ytest, tree_model.predict(Xtest))
0.5

In this case, the decision tree performed at the same level the KNN did—perhaps we need to work on our features. But how can we diagnose the model? The sklearn algorithm for decision trees can generate a diagram, defined in the dot language of the Graphviz software. This definition can then be rendered, in our case, straight in the notebook.

Following is the code for the diagram generation:

First, we need to import a corresponding function from sklearn export_grapvis, together with the pydotplus package for rendering, StringIO for in-memory, file-like objects, and IPython's Image object for visualization within the notebook:

from sklearn.tree import export_graphviz
from io import StringIO
from IPython.display import Image 
import pydotplus

Now, we need to create a file-like object (we can write like a real file to the disk, instead, if we want).
After that, we run the export_graphviz command, passing the tree to be written to the file as a diagram:

dot_data = StringIO()

export_graphviz(tree_model, out_file=dot_data, 
                filled=True, rounded=True,
                special_characters=True, feature_names=cols)

Finally, we ask pydotplus to render the chart from our pseudo-file and use Image to show the resulting image within the notebook:

graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

Here is the outcome:

As you can see, according to the model (and our training set), of all features available, the axis infantry is the most significant predictor, followed by the number of guns on the axis side.

In this section, we reviewed unsupervised and supervised machine learning models, which help us to understand the data, their internal relationships, and attempts at predicting values. The models can also be useful beyond prediction itself as, in doing so, they allow the highlighting of relationships within the dataset, quirks in the data, and the role of different properties of each record.

Table of Contents for Decision trees

Create new playlist

Sign In

Sign Up

Table of Contents for
Decision trees