Feature selection

Our CPU model only came with six features. Often, we encounter real-world data sets that have a very large number of features arising from a diverse array of measurements. Alternatively, we may have to come up with a large number of features when we aren't really sure what features will be important in influencing our output variable. Moreover, we may have categorical variables with many possible levels from which we are forced to create a large number of new indicator variables, as we saw in Chapter 1, Gearing Up for Predictive Modeling. When our scenario involves a large number of features, we often find that our output only depends on a subset of these. Given k input features, there are 2k distinct subsets that we can form, so for even a moderate number of features, the space of subsets is too large for us to fully explore by fitting a model on each subset.

Tip

One easy way to understand why there are 2k possible feature subsets is this: we can assign a unique identifying code to every subset as a string of binary digits of length k, where the digit at a certain position i is 1 if we chose to include the ith feature (features can be ordered arbitrarily) in the subset. For example, if we have three features, the string 101 corresponds to the subset that only includes the first and third features. In this way, we have formed all possible binary strings from a string of k zeros to a string of k ones, thus we have all the numbers from 0 to 2k-1 and 2k total subsets.

Feature selection refers to the process by which a subset of features in a model is chosen in order to form a new model with fewer features. This removes features that we deem unrelated to the output variable and consequently results in a simpler model, which is easier to train as well as interpret. There are a number of methods designed to do this, and they generally do not involve exhaustively searching the space of possible subsets but performing a guided search through this space instead.

One such method is forward selection, which is an example of stepwise regression that performs feature selection in a series of steps. With forward selection, the idea is to start out with an empty model that has no features selected. We then perform k simple linear regressions (one for every feature that we have) and pick the best one. Here, we are comparing models that have the same number of features so that we can use the R2 statistic to guide our choice, although we can use metrics such as AIC as well. Once we have chosen our first feature to add, we then pick another feature to add from the remaining k-1 features. Therefore, we now run k-1 multiple regressions for every possible pair of features, where one of the features in the pair is the feature that we picked in the first step. We continue adding in features like this until we have evaluated the model with all the features included and stop. Note that in every step, we make a hard choice about which feature to include for all future steps. For example, models that have more than one feature in them that do not include the feature we chose in the first step of this process are never considered. Therefore, we do not exhaustively search our space. In fact, if we take into account that we also assess the null model, we can compute the total number of models we perform a linear regression on as follows:

Feature selection

The order of magnitude of this computation is on the scale of k2, which for even small values of k is already considerably less than 2k. At the end of the forward selection process, we have to choose between k+1 models, corresponding to the subsets we obtained at the end of every step of the process. As the final part of the process involves comparing models with different numbers of features, we usually use a criterion such as the AIC or the adjusted R2 to make our final choice of model. We can demonstrate this process for our CPU data set by running the following commands:

> machine_model3 <- step(machine_model_null, scope = list(lower = machine_model_null, upper = machine_model1), direction = "forward")

The step() function implements the process of forward selection. We first provide it with the null model obtained by fitting a linear model with no features on our training data. For the scope parameter, we specify that we want our algorithm to step through from the null model all the way to our full model consisting of all six features. The effect of issuing these commands in R is an output that demonstrates which feature subset is specified at every step of the iteration. To conserve space, we present the results in the following table, along with the value of the AIC for each model. Note that the lower the AIC value, the better the model.

Step

Features in subset

AIC value

0

{}

1839.13

1

{MMAX}

1583.38

2

{MMAX, CACH}

1547.21

3

{MMAX, CACH, MMIN}

1522.06

4

{MMAX, CACH, MMIN, CHMAX}

1484.14

5

{MMAX, CACH, MMIN, CHMAX, MYCT}

1478.36

The step() function uses an alternative specification for forward selection, which is to terminate when there is no feature from those remaining that can be added to the current feature subset that would improve our score. For our data set, only one feature was left out from the final model, as adding it did not improve the overall score. It is interesting and somewhat reassuring that this feature was CHMIN, which was the only variable whose relatively high p-value indicated that we weren't confident that our output variable is related to this feature in the presence of the other features.

One might wonder whether we could perform variable selection in the opposite direction by starting off with a full model and removing features one by one based on which feature, when removed, will make the biggest improvement in the model score. This is indeed possible, and the process is known either as backward selection or backward elimination. This can be done in R with the step() function by specifying backward as the direction and starting from the full model. We'll show this on our cars data set and save the result into a new cars model:

> cars_model_null <- lm(Price ~ 1, data = cars_train)
> cars_model3 <- step(cars_model2, scope = list( 
  lower=cars_model_null, upper=cars_model2), direction = "backward")

The formula for the final linear regression model on the cars data set is:

Call:
lm(formula = Price ~ Mileage + Cylinder + Doors + Leather + Buick + Cadillac + Pontiac + Saab + convertible + hatchback + sedan,
    data = cars_train)

As we can see, the final model has thrown away the Cruise, Sound, and Chevy features. Looking at our previous model summary, we can see that these three features had high p-values. The previous two approaches are examples of a greedy algorithm. This is to say that once a choice about whether to include a variable has been made, it becomes final and cannot be undone later. To remedy this, a third method of variable selection known as mixed selection or bidirectional elimination starts as forward selection with forward steps of adding variables, but also includes backward steps when these can improve the AIC. Predictably, the step() function does this when the direction is specified as both.

Now that we have two new models, we can see how they perform on the test sets:

> machine_model3_predictions <- predict(machine_model3, machine_test)
> compute_mse(machine_model3_predictions, machine_test$PRP)
[1] 2805.762
> 
> cars_model3_predictions <- predict(cars_model3, cars_test)
> compute_mse(cars_model3_predictions, cars_test$Price)
[1] 7262383

For the CPU model, we perform marginally better on the test set than our original model. A suitable next step might be to investigate whether this reduced set of features works better in combination with the removal of our outlier; this is left as an exercise for the reader. In contrast, for the cars model, we see that the test MSE has increased slightly as a result of removing all these features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.50.87