Summary

In this chapter, we learned how to build decision trees for regression and classification tasks. We saw that although the idea is simple, there are several decisions that we have to make in order to construct our tree model, such as what splitting criterion to use, as well as when and how to prune our final tree.

In each case, we considered a number of viable options and it turns out that there are several algorithms that are used to build decision tree models. Some of the best qualities of decision trees are the fact that they are typically easy to implement and very easy to interpret, while making no assumptions about the underlying model of the data. Decision trees have native options for performing feature selection and handling missing data, and are very capable of handling a wide range of feature types.

Having said that, we saw that from a computational perspective, finding a split for categorical variables is quite expensive due to the exponential growth of the number of possible splits. In addition, we saw that categorical features can often tend to impose selection bias in splitting criteria such as information gain, because of this large number of potential splits.

Another drawback of using decision tree models is the fact that they can be unstable in the sense that small changes in the data can potentially alter a splitting decision high up in the tree and consequently we can end up with a very different tree after that. Additionally, and this is particularly relevant to regression problems, because of the finite number of leaf nodes, our model may not be sufficiently granular in its output. Finally, although there are several different approaches to pruning, we should note that decision trees can be vulnerable to overfitting.

In the next chapter, we are not going to focus on a new type of model. Instead, we are going to look at different techniques to combine multiple models together, such as bagging and boosting. Collectively, these are known as ensemble methods. These methods have been demonstrated to be quite effective in improving the performance of simpler models, and overcoming some of the limitations just discussed for tree-based models, such as model instability and susceptibility to overfitting.

We'll present a well-known algorithm, AdaBoost, which can be used with a number of models that we've seen so far. In addition, we will also introduce random forests, as a special type of ensemble model specifically designed for decision trees. Ensemble methods in general are typically not easy to interpret, but for random forests we can still use the notion of variable importance that we saw in this chapter in order to get an overall idea of which features our model relies upon the most.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.202.61