Preprocessing

The caret package allows us to do a variety of things for preprocessing our data, such as scaling, centering, removing variables with very low variability, and projecting it via principal components. The main workhorse for this is the preProcess() function.

In this recipe, we will explore how to undertake several data transformation steps, before modeling using the Boston dataset (included in the MASS package). This is a famous dataset containing house price indexes for several areas in Boston. The objective is to use several metrics for each area and predict the price index there. We will explain how to do it using random forests.

There are essentially two ways of doing this in caret:

By calling the preProcess= argument in the train function (this is less flexible, but can be used with cross-validation)
By calling the preProcess() function before calling train (this is more flexible, but requires testing and training datasets)

The first approach is usually done when we want to find the best hyperparameters in conjunction with a TuneGrid. On the other hand, the second approach gives us more control but can't be used with cross-validation: assume that we have our full dataset, and we make a certain transformation (such as imputing missing values) that will alter both the training and testing data in each fold. The problem will be that the testing data will be contaminated by data that we have learned from the training dataset. The idea behind k-fold cross validation is to split the data into several parts and use k-1 of them for training and 1 for evaluation (ensuring that no part is ever used for both things). This is a usual approach for finding the best hyperparameters for any model.

Table of Contents for Preprocessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Preprocessing