Starting with decision trees

Much like when we are learning something new, this will begin with a careless, reckless, flawed (at least on some level) approach. I encourage the reader to seek ways of improving our models and code as we go on. Later, we may end up with very similar or different solutions. If you can't address an alternative at the time or even if your alternative doesn't go well, I guarantee that, by paying that much attention, you will learn more from the reading experience.

A great way to start is by discussing which R packages we could use to make trees, highlighting which features we could expect from each of them. Table 6.2 introduces briefly some packages used to estimate tree models, as well as popular features for each of them:

Package name Title Popular features
tree Classification and regression tree Prune can be easily implemented with the tree package
rpart Recursive partitioning and regression trees Widely popular package; prune is also implemented by rpart
ipred Improved predictors Easily implement bagging for regression, classification, and survival trees
Cubist Regression modeling using rules with added instance-based corrections. Train tree-like models that use committees; very similar to boosting techniques
gbm Generalized boosted regression models

Implement several models using boosting

Table 6.2: Tree-related packages

Bagging, boosting, and committees are more closely related to random forests (spoiler alert) as they calibrate several tree models. For this reason, I won't compare ipredCubist, and gbm with rpart and tree. Those will only be compared with the randomForest package. In this section, we will be comparing rpart and tree.

Let's get started by splitting the car::Chile DataFrame into train, validation, and test sets:

dt_Chile <- Chile[complete.cases(Chile),]
set.seed(50)
i_out <- sample(dim(dt_Chile)[1],
size = round(dim(dt_Chile)[1]*.3))

By calling complete.cases(), we make sure to eliminate any row that could display  NA. Although the original dataset had 2,700 observations, only 2,431 of them accounted for complete information for all columns. This complete dataset is now stored by the object called dt_Chile.

Fancier techniques for sampling are available, especially for voting intentions. For the time being, this seemingly random approach that uses sample() will be enough. Approximately 30% of row indexes from dt_Chile were stored into the i_out object. There is yet more to do.

Previously in the Linear regression with R section, we used a two-part sample approach (train and test). Now we will be using three parts: train, validation, and test. We already ruled what won't be in the estimation dataset (i_out); it's time to split it into test and validation sets. Here is a way to do it:

val <- i_out[ 1 : (length(i_out) %/% 2)]
test <- i_out[ (length(i_out) %/% 2 + 1) : length(i_out)]

The %/% operator is simply asking for the quotient given by a division. As the latter code block is designating, the val and test objects are subsets of i_out. Each of these objects is getting around half of its original set, i_out. The validation data will be useful later while pruning.

Do not forget, the i_out, val, and test objects carry indexes related to the real data (dt_Chile) and not actual data. Those will be used for subsetting later.

Let's start to grow trees with the tree and rpart packages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.134