Chapter 5. Random Forest

In this chapter we will look at the random forest machine learning algorithm. It is a wonderful algorithm: effective on a wide range of data sets, while having relatively few parameters to tune. It is a decision tree algorithm (as is GBM, which we look at it in the next chapter).

I start with a brief look at basic decision trees, then how random forest is different, and then go through the optional parameters that H2O’s implementation offers. Then I apply random forest to each of the three data sets: first out-of-the-box, with all defaults, then using a tuning process to find the best single model I can. Each of the subsequent three chapters will follow this same pattern. As the first of the four chapters, grids—a great tool to aid in tuning—are also introduced here. The results of all models are summarized at the end of the book, in Chapter 11.

The tuning process is to try and improve on the default settings. But the H2O implementations tend to have good defaults that adapt to characteristics of your data, so I quickly reach the point of diminishing returns. Have in your mind how much time and effort a certain increase in model accuracy is worth. Maybe your day is better spent on feature engineering than tuning? Maybe $1000 would be better spent on additional data (whether buying data sets, or running your own surveys) than buying 500 node-hours on EC2 to run grids?

Throughout this book, but particularly in the next few chapters, I’ve deliberately shown some of the bad hunches, the wrong turns, and the just plain (with hindsight) stupidity. I find these just as educational as seeing what worked in the end, and I hope you do too.

Decision Trees

Decision trees, in their simplest form, are perhaps the most easily understandable approach to machine learning. Unlike the black box of neural networks, or the mathematical equations of linear models, decision trees look just like a flow chart.

See Figure 5-1 for an example of a decision tree for classification (these are called classification trees).

pmlh 0501
Figure 5-1. A classification tree: deciding whether to walk or catch a taxi

You don’t need to know much about how decision trees are built to be able to use them. Basically, the variable at each level that will give the best split is chosen, and the most common definition of best is information gain.

They can also be used as regression trees, where the value to be learned is a continuous variable. Regression trees are really still just classifying: the value in the leaf nodes is the average of all the training data that matched that branch of the tree. They are built by choosing the variable that gives the greatest reduction in standard deviation at each node. See Figure 5-2.

Random Forest

Random forest is an ensemble algorithm, meaning more than one model is made, and their results used together, the aim being to cope better with unseen situations (i.e., to avoid overfitting). This is not the only time we will meet ensemble algorithms in this book.

pmlh 0502
Figure 5-2. A regression tree: estimating how long a car journey will take

If you train a decision tree on a fairly complex data set (and don’t take precautions against overfitting), you will find a very deep tree full of fragile rules. The idea behind random forest is to instead have lots of trees. Then, when you use it to predict on new data, you give the new data to each of those trees and ask each for their prediction. If it’s a classification you choose the most popular answer, and if it’s a regression you take the mean of each tree’s answer.

That is the “forest” half. The other half, the “random,” says that when training you don’t give each tree all the training data; you randomly hold back some rows, or hold back some columns. This makes each individual tree a bit dumber than if it had seen all the data. But when their results are averaged together the whole is more intelligent than any one part.

Parameters

Most of the parameters were introduced in Chapter 4, but there are some specific to random forest. For Python users, all of these are given when creating the object, not when calling train().

The two most important parameters are:

ntrees

How many trees in your forest.

max_depth

How deep a tree is allowed to grow. In other words, how complex each tree is allowed to be.

Together these two parameters control how big your random forest will be: acres of squat apple trees, or a small grove of giant oaks. The defaults are 50 trees, to a max depth of 20. The training time is going to be roughly proportional to the number of trees times the number of training rows. So you want ntrees to be as small as possible… but no smaller.

The control of the random part is done by parameters already introduced in “Sampling, Generalizing”. To remind you:

mtries

This is how many variables to randomly choose as candidates at each split. The default is –1, which means StartRoot p EndRoot for classification, or p/3 for regression (where p is the number of columns). Set it to the number of columns in train to have it use all variables.

col_sample_rate_change_per_level

Relative change of the column sampling rate for every level in each tree. The default is 1.0. If less than 1, then it will have fewer columns to choose from as it gets deeper in the tree. If greater than 1 (maximum 2.0) then it will have more columns to choose from.

col_sample_rate_per_tree

This can be from 0.0 to 1.0. It is at the tree level, rather than at the split level as with mtries and col_sample_rate.

sample_rate

The default is 0.632, which means each tree is trained on 63.2% of the training data.

sample_rate_per_class

Like sample_rate but you give the value for each class. See the description of class_sampling_factors under “Data Weighting”.

The next two parameters control if splitting is done:

min_rows

How many training data rows are needed to make a leaf node. The default is 1, meaning that you can have a path through the tree that represents something that was only seen once in the training data. That obviously encourages overfitting. But, if you know you have some cases only represented once in your data, then 1 is what you want.

min_split_improvement

Each time a split happens there is reduction in the inaccuracy, in the error. This controls how much that error reduction has to be to make splitting worthwhile. The default is zero.

The next set of parameters control how the splitting is done:

histogram_type

What type of histogram to use for finding optimal split points. Can be one of “AUTO,” “UniformAdaptive,” “Random,” “QuantilesGlobal,” or “RoundRobin.”

nbins

For numerical columns, build a histogram of (at least) this many bins, then split at the best point. The default is 20.

nbins_top_level

For numerical columns, build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level. The default is 1024.

nbins_cats

For categorical columns, build a histogram of (at most) this many bins, then split at the best point. Higher values can lead to more overfitting. The default is 1024.

The next one only applies to binary classification:

binomial_double_trees

Build one set of trees for each output class. Can give higher accuracy, and the trade-off is that you get twice as many trees. (ntrees * 2 will be built.)

Finally, when using random forest on a cluster there is a fair bit of network communication. Unless you are using the cluster because you have big data, I recommend you set this to true. With small data sets the communication overhead will destroy any benefit you were hoping to get from using those other nodes.

build_tree_one_node

Run on one node only. You will only be using the CPUs on that node; the rest of the cluster will be unused.

Building Energy Efficiency: Default Random Forest

This data set has to do with the heating costs of houses (see “Data Set: Building Energy Efficiency” if you skipped the earlier introduction to it), and it is a regression problem. If you are following along, run either Example 3-1 (for R) or Example 3-2 (for Python) from the earlier chapter, which sets up H2O, loads the data, and defines train, test, x, and y. (See “Jargon and Conventions” in Chapter 1 for a reminder of naming conventions.)

There is no valid (i.e., no validation data set); instead we will use k-fold cross-validation. It is a relatively small data set, and is quick for random forest to model, so I have used 10-fold. (Refer back to “Cross-Validation (aka k-folds)” if you need a reminder about cross-validation.)

With our train and test data sets prepared, training the random forest is a one-liner, which takes just a couple of seconds to run:

m <- h2o.randomForest(x, y, train, nfolds = 10, model_id = "RF_defaults")

In Python that looks like:

m = h2o.estimators.H2ORandomForestEstimator(model_id="RF_defaults", nfolds=10)
m.train(x, y, train)

Now, type m to see how the training went. I will show an extract from an IPython console here (trimmed to fit) but the R client shows all the same figures. By the way, in R, summary(m) shows more information than just printing m:

In [23]: m.train(x, y, train)
drf Model Build progress: |██████████████████████████████| 100%

In [24]: m
Out[24]: Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  RF_defaults
Model Summary:
 num_of_trees model_size min_depth max_depth min_leaves max_leaves mean_leaves
 ------------ ---------- --------- --------- ---------- ---------- -----------
 50           133472     20        20        77         364        204.56

ModelMetricsRegression: drf
** Reported on train data. **

MSE: 3.29872615438
RMSE: 1.81623956415
MAE: 1.24760757203
RMSLE: 0.0567712695161
Mean Residual Deviance: 3.29872615438

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 3.22444719482
RMSE: 1.79567457932
MAE: 1.23071814584
RMSLE: 0.0561318898512
Mean Residual Deviance: 3.22444719482

Firstly it says there are 50 trees, and in this case they all used the maximum allowed depth of 20. (You will see some random variation from run to run, unless you set a seed.1)

Under regression metrics (the ones “Reported on cross-validation data”) I see an MSE, mean squared error, of 3.224. Not zero, so our model is not perfect. See “Supported Metrics” in the previous chapter for more on the various metrics.

If I look over on Flow I will see 11 models, one for each of the 10 folds, and then one final model on the whole data. The preceding model summary gave me all those metrics on each fold. In this extract notice the wide range—the average MSE was 3.16, but ranged from 2.33 to 4.85:

Cross-Validation Metrics Summary:
                   mean       sd          cv_1_valid  ...  cv_9_valid cv_10_valid
-----------------  ---------  ----------  ----------  ...  ---------- -----------
mae                1.21774    0.144463    1.00348     ...  1.60152    1.07393
mse                3.16326    0.682396    2.23307     ...  4.85474    2.55511
r2                 0.965288   0.0048532  0.973659    ...  0.955043   0.971501
residual_deviance  3.16326    0.682396    2.23307     ...  4.85474    2.55511
rmse               1.75908    0.185602    1.49434     ...  2.20335    1.59847
rmsle              0.0553582  0.0044408  0.0466292   ...  0.0655638  0.0514496

How does it do on the unseen test data? In R you get that with h2o.performance(m, test), in Python m.model_performance(test), and it looks like this:

In [25]: m.model_performance(test)
Out[25]:
ModelMetricsRegression: drf
** Reported on test data. **

MSE: 3.62649127211
RMSE: 1.90433486344
MAE: 1.33001699261
RMSLE: 0.0582354639097
Mean Residual Deviance: 3.62649127211

A higher MSE than on either our training data, or our cross-validation mean. The square root of MSE (RMSE here) is in real-world units, kWh/(m²yr)—i.e., kilowatt-hours, per square meter of floorspace, per year—so just a reminder that we were trying to predict Y2, the required cooling load. The range on Y2 is from 10.90 to 48.03. The RMSE of 1.90 is not too bad—the guesses are in the right ballpark. The average error, that is. The mean can hide all kinds of sins, so let’s also look at the results graphically2; see Figure 5-3.

pmlh 0503
Figure 5-3. Default performance of random forest on test data

The black dots are the correct answers, the squares are relatively close predictions, and the up and down triangles are the worst predictions; for the sake of this plot a bad prediction was defined as 8% above or below the correct answer, and that represents 27 (14 too high, 13 too low) of the 143 test samples here.3

Grid Search

All the H2O machine learning algorithms have parameters: knobs, which you can tweak, that will often affect the performance of the model you build. But the interactions between the parameters can be complex.

The labor-intensive way is to try a model, evaluate it, then fiddle with one of the parameters, and repeat. If your intuition is good this may be the most efficient way.

A more systematic way would be to set up nested loops of all the values for each parameter that you think might be important. So, for deep learning you might try 100, 200, and 300 epochs, with three network toplogies (200x200, 64x64x64, 500x50), and L1 regularization of 0 or 0.0001. (The meaning of these parameters will be explained in Chapter 8.) That is 18 combinations, so it takes 18 times as long as making one model.

Alternatively, rather than comprehensively trying all 18 combinations, you might randomly choose 6 of them to try, which only takes one-third of the time, and the hope is that you still get to learn which are best and worst values for the parameters.

When we created the random forest, it used defaults for all the parameters, except for specifying the 10-fold cross-validation. But the results were not as good as they can be. So the question becomes, how can we make it better?

You, at the back, what was that? “Throw a load of trees at it.” Brute force, I like your style, sir. And what was that ma’am? Early stopping? (See “Early Stopping”.) Excellent. You are the yin to his yang. But, you lack depth. No, not you personally, sir, you have plenty of depth.

I mean you could also make deeper trees. H2O’s random forest defaults to ntrees=50 trees, with a max_depth of 20. Do we want to try 100 trees, keeping depth as 20? Or keep 50 trees, but allow them to grow to a depth of 40? Or both?

Those are the easy ones to tune: higher values are better (well, to the point of diminishing returns, at least). But then there are all the fiddly ones. For instance, is mtries better nudged a bit higher, or nudged a bit lower? Don’t look at me, I don’t know. I just work here.

Grids are the solution to this dilemma, and the H2O implementation currently comes in two forms:

  • Comprehensive ("Cartesian")

  • Random ("RandomDiscrete")

Cartesian

The first, the default, will try all combinations. Here is an example of that type of grid, first in R, then in Python:

g <- h2o.grid("randomForest",
  hyper_params = list(
    ntrees = c(50, 100, 120),
    max_depth = c(40, 60),
    min_rows = c(1, 2)
    ),
  x = x, y = y, training_frame = train, nfolds = 10
  )
Tip

If you ever get a 500 Server Error with h2o.grid(), in R, check that you’ve given the algorithm name correctly! It is case-sensitive.

import h2o.grid

g = h2o.grid.H2OGridSearch(
  h2o.estimators.H2ORandomForestEstimator(
    nfolds=10
    ),
  hyper_params={
    "ntrees": [50, 100, 120],
    "max_depth": [40, 60],
    "min_rows": [1, 2]
    }
  )
g.train(x, y, train)

In R, you call h2o.grid, telling it which function to run, the hyper-parameters, then the constant parameters. By constant parameter I mean a model parameter that you don’t want to experiment with in the grid, so it will have a fixed value in each model. In Python you make the H2OGridSearch object, giving it an instance of the function to use, with most of the constant parameters, then you give the hyper-parameters as a dictionary of arrays; next you call train() just as you do when calling train() on a model object.

hyper_params specifies the combinations we want it to try. So, here I have given three alternatives for ntrees, two for max_depth, and two for min_rows. 3 x 2 x 2 = 12, so 12 models will get made. Because of the combinatorial explosion, each additional hyper-parameter that gets added has a huge effect on the time taken to complete.

Type g (whether using R or Python) to get output like this:

 min_rows ntrees max_depth      model_ids      deviance
1     1    120        60 RF_structure1_model_10  3.2616
2     1    120        40  RF_structure1_model_4  3.2616
3     1    100        60  RF_structure1_model_8  3.2724
4     1    100        40  RF_structure1_model_2  3.2724
5     1     50        60  RF_structure1_model_6  3.3210
6     1     50        40  RF_structure1_model_0  3.3210
7     2    120        40  RF_structure1_model_5  3.3518
8     2    120        60 RF_structure1_model_11  3.3518
9     2    100        40  RF_structure1_model_3  3.3525
10    2    100        60  RF_structure1_model_9  3.3525
11    2     50        60  RF_structure1_model_7  3.3662
12    2     50        40  RF_structure1_model_1  3.3662

It has ordered from best to worst: lower residual deviance (equivalent to MSE here) is what we are after. Though the range of values looks narrow, we’ve actually learned a lot from this. First, min_rows of 1 is always better than 2. Second, max_depth of 40 and 60 gives exactly the same result.4 And, third, that more ntrees was always better. Such clear-cut results are unusual—normally you have to piece apart these conclusions.

To seed or not to seed? That is the question: whether ’tis nobler in the mind to suffer the slings and arrows of outrageous random variation, or to take arms against a sea of troubles by setting the same seed each time. Or—and I feel Hamlet overlooked this third choice—you could set seed as one of the grid hyper-parameters, and get a feel for how much random variation is disturbing your conclusions. I use this third approach regularly.

Including nfolds=10 makes the computation much slower (24 seconds instead of 5.5 seconds in this case, so 4x to 5x slower), but the estimate of model performance becomes more consistent. When I tried without nfolds, a couple of the min_rows=1 entries ended up down the bottom, instead of in the middle, and the range of deviance was bigger. Trying nfolds=5 took 17 seconds instead of 24 seconds, but all the deviances were higher; however, it gave the same ordering as for nfolds=10 and that is the important thing.

You can use a different metric to order your grid models; the preceding was using the default of deviance. If you were most interested in how they compare on the R² metric, the next code is what you want. Notice how R’s h2o.getGrid() takes a grid ID, rather than a grid object:

g_r2 <- h2o.getGrid(g@grid_id, sort_by = "r2", decreasing = TRUE)
Tip

To find out your sorting options, give sort_by = "xxx" and they are listed in the error message. Refer to “Supported Metrics” if some are unfamiliar.

In Python you get all models when you print g or g_r2. However, printing a grid in R currently shows just the best 6 and worst 6 models; if you have more than 12 models then you will need to download it with this rather clunky idiom:

as.data.frame( g_r2@summary_table )
Tip

Grids in H2O are basically just a set of nested loops (or random parameter selection for the other form), and you could hack your own solution in a few minutes. On the other hand, they get their own top-level menu in Flow, and the g object is a nice container for the models, and the API comes with tables to compare them.

Just bear in mind that if you start craving more flexibility, and don’t mind giving up those home comforts, you could write your own loop…

RandomDiscrete

The other mode for grids is “RandomDiscrete.” Use this when you have so many hyper-parameters that trying all combinations exhaustively would be, well, exhausting. RandomDiscrete will jump from one random combination to another. It needs some additional parameters to control when it should stop, and you should specify at least one of these:

max_models

Make this many models, then stop.

max_runtime_secs

Run for this long, then stop.

stopping_metric

AUTO, misclassification, etc.

stopping_tolerance

For example, 0.0001, to require at least 0.01% improvement in the given metric.

stopping_rounds

For example, 5. In combination with stopping_tolerance of 0.0001 it means: if none of our last five random models has managed to be 0.01% better than the best random model before that, then stop.

The three stopping choices work just like the ones we saw in “Early Stopping”, for stopping a model’s learning, but they are being applied at the grid level. Note that you can still have these three stopping parameters in your grid’s hyper-parameters, or constant parameters, and these will apply to each model that is built.

To see how this works, the next example uses early stopping on both the models and on the grid. The grid search will stop if the best MSE out of the last 10 models is not at least 0.1% better than the best MSE of a model the grid made before those 10. There is also an overall time limit of 120 seconds.

The hyper-parameters being tried are:

  • ntrees: from 50 to 250.

  • mtries: the building energy data set has eight predictor columns, so the default of 8÷3, rounded-down, is 2. That feels unreasonably low, so I also try 3, 4, and 5.

  • sample_rate: the default is 0.632, so a bit below, a bit above, and then quite a lot above (95% of samples).

  • col_sample_rate_per_tree: the default is 1.0, so a bit below that, then a lot below that (50%).

That is a total of 240 model combinations.

The max_depth is fixed at 40 for all models, and 5-fold cross-validation is used (instead of 10-fold, so as to speed things up). Then the per-model early stopping says that if we go four scoring rounds without any improvement at all (scoring_tolerance=0) in the deviance, then stop.

Tip

Normally random forest is scored after every tree is added, but score_tree_interval=3 (which is just a way of telling it to spend more time building trees relative to time spent scoring) combined with four scoring rounds actually means 12 trees have to be added, with zero improvement, before it will stop early.

g <- h2o.grid("randomForest",
  search_criteria = list(
   strategy = "RandomDiscrete",
   stopping_metric = "mse",
   stopping_tolerance = 0.001,
   stopping_rounds = 10,
   max_runtime_secs = 120
   ),
  hyper_params = list(
    ntrees = c(50, 100, 150, 200, 250),
    mtries = c(2, 3, 4, 5),
    sample_rate = c(0.5, 0.632, 0.8, 0.95),
    col_sample_rate_per_tree = c(0.5, 0.9, 1.0)
    ),
  x = x, y = y, training_frame = train,
  nfolds = 5, max_depth = 40,
  stopping_metric = "deviance",
  stopping_tolerance = 0,
  stopping_rounds = 4,
  score_tree_interval = 3
  )

For me, it ran for a bit under 90 seconds, and made 61 of the possible 240 models. The early stopping meant it never used all the tree allowance it was given; in fact, the biggest model used 70 trees. This, of course, means it was a waste to have ntrees as a hyper-parameter! On another run, I got just 42 models, with the largest forest having 51 trees.

For mtries the best model used 5, but 3 and 4 were also in the best 3. However, the default of 2 seems to have done relatively poorly. So, for our next grid consider dropping 2. (In another run, the top 10 were all 3 and 4.) For col_sample_rate all the 0.5 models were in the bottom-third, but 9 of the best 10 used 0.9, rather than the default of 1.0. How about 0.85, 0.90, and 0.95 for the next round? sample_rate is less clear-cut, but the top 9 are either 0.632 or 0.8, while 0.95 looks poor. Maybe 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, and 0.85 for the next round? In yet another run 3 of the top 6 were 0.95. So, maybe 0.55, 0.65, 0.75, 0.85, 0.95? Or, maybe this is a sign to stop trying to tune it? That is, your energy is better spent elsewhere.

Warning

Certain combinations of parameters can be illegal, and when this happens those models will just fail to build, but the rest of the grid will complete, and you may not realize there was a problem.

If you look on Flow you will see the error messages. To see them in Python, type g.failure_details (you get no output if there are no problems). You can see failures in R by just outputting the grid with g.

Currently H2O’s grid implementation is still a bit immature: there is no mode yet that guides its search by which parameters on previous models worked better. There is also no support yet for running models in parallel over a cluster. There is also no way to have dependencies between the parameters, or have one be a function of another. For instance, you might want to try various values of sample_rate, but increase ntrees when learn_rate is lower. If you need that level of flexibility, or any of the other missing features of h2o.grid, you’ll need to implement your own version. Alternatively, put a high-level loop on top of multiple calls to h2o.grid(), and combine the results. (You will see this used in Chapter 8, when trying to experiment with differing number of hidden layers, for instance.)

Tip

If you use the same grid_id on multiple grid requests, the results get merged! This can allow you to narrow in on your parameters, but still see all the models in one big table.

High-Level Strategy

What is the best way to use grids? I often start with a small exhaustive search, testing the most important few parameters, to get a feel for where it might be going. Then I do a few random searches across more of the parameters (this grid search tutorial shows the list of parameters that can be tuned, for each algorithm), with relatively short iterations (on the order of a few minutes to complete). After each iteration see if there are any obvious conclusions to use to guide the next iteration. (col_sample_rate=0.5 being bad was a good example.)

As you narrow in on what you think are the best iterations you may switch to cartesian (exhaustive search) mode. This is also a good time to try three or four random seeds, to test the sensitivity to random numbers. And if you reduced the number of k-folds, to speed things up, it is a good idea to increase it again, for a bit more precision.

Warning

Never use the test data set in the grid search phase. Rely on the validation score, or the cross-validation score. Only when you have selected a final model, “ready for production,” can you then evaluate it on the test set.

Building Energy Efficiency: Tuned Random Forest

The previous section, “Grid Search”, has made a good start on improving the parameters. There may not be much more to do.

I’ve tried using nbins, which is the minimum number of levels to divide a numeric predictor column into, when considering how to split on it. It defaults to 20, but most of our numeric fields don’t have even 20 distinct values. So I tried values of 8, 12, 16, 20, and 24. The conclusion? All five possible values were used in the top 8 models (out of the 49 that were made in that grid iteration). So that was a failed experiment; best to leave nbins as the default.

If I was trying to get two or three diverse models for an ensemble (see “Ensembles”) then I would perhaps choose (out of the best models) those with the biggest range of hyper-parameters. But, for the sake of choosing one best model, I did a few more iterations of grids and settled on:

  • max_depth = 40

  • ntrees = 200

  • sample_rate = 0.7

  • mtries = 4

  • col_sample_rate_per_tree = 0.9

  • nbins left as default (20)

The choice of random seed was a big influence, so I chose six seeds, and used the following listing to compare the default model with this tuned model:

seeds <- c(101, 109, 373, 571, 619, 999)

defaultModels <- lapply(seeds, function(seed){
  h2o.randomForest(x, y, train, nfolds = 10, seed = seed)
  })

tunedModels <- lapply(seeds, function(seed){
  h2o.randomForest(x, y, train, nfolds = 10, seed = seed,
    max_depth = 40, ntrees = 200, sample_rate = 0.7,
    mtries = 4, col_sample_rate_per_tree = 0.9,
    stopping_metric = "deviance",
    stopping_tolerance = 0,
    stopping_rounds = 5,
    score_tree_interval = 3)
  })

def <- sapply(defaultModels, h2o.rmse, xval = TRUE)
tuned <- sapply(tunedModels, h2o.rmse, xval = TRUE)

boxplot(c(def, tuned) ~ c((rep(1, 6),rep(2, 6)) )

Figure 5-4 shows how there does seem to be a distinct improvement, though the best of the default models, and the worst of the tuned models, stop it from being clean.

pmlh 0504
Figure 5-4. Box plot comparing random variation in default and tuned models

I chose RMSE, but MSE gives identical conclusions.5 I also chose to plot the cross-validation metric, rather than the self-train metric. Oh, and check the y-axis before you get too impressed—notice how far away 0.0 is.

Remaking the same chart from earlier,6 can you spot the difference? We went from 14 too high, 13 too low, to 8 too high and 17 too low! We come to the same conclusion as the previous box plots: tuning has given us a little improvement, but nothing earth-shattering.

The results of all models on this data set will be compared in “Building Energy Results” in the final chapter.

pmlh 0505
Figure 5-5. Tuned performance of random forest on test data
Note

As a follow-up, I tried repeating the grid experiments, but without early stopping: each model got to use all 200 trees. I settled on almost the same parameters, but a higher sample_rate of 0.9.

Naturally, training took more time. The cross-validation MSE was slightly better, with a narrower range, but the MSE was notably worse on the test data, suggesting that early stopping was preventing some overfitting?

MNIST: Default Random Forest

This is a pattern recognition problem (see “Data Set: Handwritten Digits” for a reminder), and because we are trying to assign one of ten values to each sample, it is a multinomial classification problem. If following along, run either Example 3-3 or Example 3-4 from the earlier chapter, which sets up H2O, loads the data, and defines train, valid, test, x, and y.

Unlike before, with the building energy data set, this time we have a valid set, and so won’t use cross-validation. Here is the Python code:

m = h2o.estimators.H2ORandomForestEstimator(model_id="RF_defaults")
m.train(x, y, train, validation_frame=valid)

And the R code:

m <- h2o.randomForest(
  x, y, train, model_id = "RF_defaults", validation_frame = valid
  )

That code takes about two minutes to run on my machine; all eight cores were used equally at almost 100%.

Note

If you look in the model information you might see it says 500 trees were made—not the 50 that the default settings requested! What is going on is that with a multinomial classification, for both random forest and GBM, one internal tree is built per output class. (Binomial and regression tree models have just the one internal tree per requested tree.) Each internal tree is predicting how likely a value is in that class, and we have 10 classes. Those 10 internal trees each produce a probability. The class with the highest probability is the one that is chosen as the prediction (but all the probabilities are returned if you want to do something more sophisticated with them).

If you look at the confusion matrix (over on Flow, or from R with h2o.confusionMatrix(m, valid = TRUE), or m.confusion_matrix(valid) on Python) you will see it has done rather well: it got only 370 out of the 10,000 validation samples wrong.

Tip

I use this little MNIST-specific helper in Python, to both quickly view the confusion matrix, and get rid of all the annoying “.0” on the end of the counts:

def cm(m, data=valid):
  d = m.confusion_matrix(data).as_data_frame()
  d[list("0123456789")] = d[list("0123456789")].astype(int)
  return(d)

0

1

2

3

4

5

6

7

8

9

Error

Rate

0

1004

0

0

1

1

2

3

1

3

2

0.012783

13 / 1,017

1

0

1083

7

1

2

0

2

3

1

2

0.016349

18 / 1,101

2

4

5

941

4

4

0

3

2

10

2

0.034872

34 / 975

3

0

0

16

976

4

7

0

13

12

5

0.055179

57 / 1,033

4

1

1

3

0

910

0

3

1

4

26

0.041096

39 / 949

5

2

2

3

14

1

897

12

1

5

3

0.045745

43 / 940

6

4

0

1

0

0

12

935

0

2

0

0.019916

19 / 954

7

1

0

13

2

8

1

0

1002

2

17

0.042065

44 / 1,046

8

9

3

7

11

4

7

3

1

931

10

0.055781

55 / 986

9

2

0

2

14

15

2

0

5

8

951

0.048048

48 / 999

10

1027

1094

993

1023

949

928

961

1029

978

1018

0.0370

370 / 10,000

When you look at h2o.hit_ratio_table(m, valid=TRUE) (in Python, m.hit_ratio_table(valid=True)), it is a bit less impressive:

    k hit_ratio
1   1  0.963000
2   2  0.988400
3   3  0.993100
4   4  0.996500
5   5  0.997500
6   6  0.998000
7   7  0.998500
8   8  0.998600
9   9  0.998600
10 10  1.000000

There were 10,000 samples. You can interpret the first value of 0.9630 as: “On the first guess it got 9630 correct and 370 wrong.” The second value is 0.9884, and (0.9884 - 0.9628) * 10000 = 256, meaning it got another 256 right on its second guess. And so on. The “0.9986” in the ninth row means there were (1 - 0.9986) * 10000 = 14 that it still hadn’t got after nine guesses.

h2o.performance(m, test) is how to evaluate it on the test data. Running this it told me 327 wrong, so in fact it did slightly better than the 370 score on the validation test data. And there were only 7, not 14, that it couldn’t get after nine guesses (the h2o.performance() function outputs all this information).

MNIST: Tuned Random Forest

In our default random forest the max_depth of each tree was 20, but almost every tree was banging into that limit. So increasing that parameter is an obvious tuning idea. And giving it more trees is another sensible idea. But both of those mean it will take longer to learn each model. So, I start by adding early stopping, trying to find a compromise between not killing a model before it gets chance to shine, and not taking too long to compute!

stopping_tolerance = 0.0001,
stopping_rounds = 3,
score_tree_interval = 3,

These say that if it hasn’t improved by at least 0.01% over the last 9 trees, then stop and call it a day. With that in place, we can increase ntrees from 50 to 500 (and hope it doesn’t use all 500 each time).

The other thing for the initial grid is seeing if min_rows is important. It wasn’t important with the building energy data, but that data set only had 768 rows; now we have 50,000 rows. So I tried min_rows of 1, 2, and 5. And max_depth of 20, 50, and 120 was tried. I also used two random seeds to get a feel for sensitivity to randomness.

It ran for over an hour before I stopped it early. max_depth of 50 and 20 were almost identical, with 50 just a fraction better each time, and the couple of max_depth = 120 models that completed were exactly identical to the depth 50 ones. The min_rows = 5 was distinctly worse, while min_rows = 1 was fractionally better than 2. Each model used between 66 and 132 trees.

The best model from the first grid is better than the default model, though not by anything amazing: 96.62% correct on the first guess, compared to 96.3% with default settings; in other words, a net improvement of 32 more samples correctly recognized. (Remember that random variations mean you are likely to see slightly different results.)

What else might we try? The 10 classes are not perfectly balanced, but are not far off, so there should be no need to weight any training rows. What about sampling? mtries defaults to the (rounded-down) square root of the number of columns, and the square root of 784 is 28. (With the enhanced data, the square root of 898 columns, rounded down, is 29—close enough not to make much difference.) We could try a higher number. We could try fiddling with col_sample_rate_per_tree and sample_rate too. And with those changes, maybe different values for min_rows and max_depth have become more important, so try a couple each for those:

g2 <- h2o.grid("randomForest", grid_id = "RF_2",
  search_criteria = list(
   strategy = "RandomDiscrete",
   max_models = 20  #Of the 108 possible
   ),

  hyper_params = list(
    min_rows = c(1, 2),
    mtries = c(28, 42, 56),
    col_sample_rate_per_tree = c(0.75, 0.9, 1.0),
    sample_rate = c(0.5, 0.7, 0.9),
    max_depth = c(40, 60)
    ),

  x = x, y = y, training_frame = train,
  validation_frame = valid,
  ntrees = 500,
  stopping_tolerance = 0.0001,
  stopping_rounds = 3,
  score_tree_interval = 3
  )

Altogether that gave 108 possible models, and I set it to stop after 20 models. However, after about 2.5 hours (!) I decided the 17 models were enough so I canceled it at that point. The results are shown next:

sample_rate min_rows max_depth mtries col...tree  logloss
        0.9        2        60     42        0.9  0.21656
        0.7        2        60     56        0.9  0.22380
        0.7        2        60     56          1  0.22463
        0.9        1        40     28       0.75  0.22666
        0.9        1        40     28        0.9  0.22854

        0.5        2        40     56          1  0.23525
        0.5        2        60     56        0.9   0.2388
        0.5        1        60     56        0.9  0.24018
        0.5        1        60     56       0.75  0.24054
        0.7        1        40     28          1  0.24193
        0.5        2        60     42          1  0.24331

        0.5        1        60     42       0.75  0.25012
        0.5        2        60     28          1  0.25705
        0.5        1        40     28          1  0.25745
        0.5        2        60     28       0.75  0.26080
        0.5        1        40     28       0.75  0.26168
        0.5        1        60     28       0.75  0.27034

This RandomDiscrete grid search has chosen 0.5 for the sample_rate parameter 11 times out of 17 (compared to only three times each for 0.7 and 0.9); these things happen with random processes. But, they have all done worse than than 0.7 and 0.9 so I think we can confidently say sample_rate=0.5 is a bad choice. If so, I will often mentally remove them, so the grid results now look like this:

sample_rate min_rows max_depth mtries col...tree  logloss
        0.9        2        60     42        0.9  0.21656
        0.7        2        60     56        0.9  0.22380
        0.7        2        60     56          1  0.22463
        0.9        1        40     28       0.75  0.22666
        0.9        1        40     28        0.9  0.22854
        0.7        1        40     28          1  0.24193

That looks clear-cut! min_rows of 2 is better than 1, max_depth of 60 is better than 40, and mtries of 28 is not good? Maybe, but the unfortunate random sampling hurts us here too: every time we had min_rows of 1 we also had max_depth of 40 and also had mtries of 28. Maybe only one of these is significant, and the other two are just along for the ride? Second and third best differ only by the col_sample_rate_per_tree value, as do fourth and fifth. So maybe 0.75 is superior to 0.9 is superior to 1.0? But the logloss is very close in each case.

For a third grid I varied mtries (42 or 56) and sample_rate (0.7 or 0.9) and then tried with two different random seeds, keeping min_rows constant at 2 and max_depth constant at 40. (The results given here also merge in the best five from the previous grid.)

sample_rate seed min_rows max_depth mtries logloss
        0.9  999      2        40     56   0.20486
        0.9  999      2        40     42   0.21214
        0.9  101      2        40     42   0.21454
        0.9  300      2        60     42   0.21656
        0.9  101      2        40     56   0.22024
        0.7  999      2        40     56   0.22317
        0.7  350      2        60     56   0.22380
        0.7  400      2        60     56   0.22463
        0.7  101      2        40     56   0.22358
        0.9  450      1        40     28   0.22666
        0.9  500      1        40     28   0.22854
        0.7  999      2        40     42   0.23153
        0.7  101      2        40     42   0.23511

It looks like a sample_rate of 0.9 is better than 0.7; the jury is still out on if a higher mtries has an effect, but I’m going to go with 56.

I will go with the first model in that grid as the Chosen One. The grid was called g3; I can fetch and evaluate that first model on the test data with:

bestModel <- h2o.getModel(g3@model_ids[[1]])
h2o.performance(bestModel, test)

It gets 305 wrong out of 10,000, which is 22 better than the default model. (Second guess, third guess, etc. are all slightly better too.)

Enhanced Data

If I repeat the default random forest model, but use the enhanced MNIST data (the extra 113 columns), it gets 355 wrong in the validation set and 326 wrong in the test set. So, only a smidgeon better than on the pixel-only data.

When I tried that final grid with the enhanced MNIST data I got these grid results:

  sample_rate seed min_rows max_depth mtries col_sample_rate_per_tree     logloss
1         0.9  999        2        40     56                      0.9  0.19946616
2         0.9  101        2        40     56                      0.9  0.20147704
3         0.9  101        2        40     42                      0.9  0.20494049
4         0.9  999        2        40     42                      0.9  0.20636335
5         0.7  101        2        40     56                      0.9  0.21045509
6         0.7  999        2        40     42                      0.9  0.21215725
7         0.7  101        2        40     42                      0.9  0.21510395
8         0.7  999        2        40     56                      0.9  0.21713063

We get similar conclusions (0.9 better than 0.7, 56 better than 42), and slightly better logloss results. However, the following confusion matrix on the best model shows it was one worse, with an error rate of 306 instead of the earlier 305 when using just the raw pixels. Bad luck?7

Confusion Matrix: vertical: actual; across: predicted
         0    1    2    3   4   5   6    7   8    9  Error           Rate
0      971    0    0    0   0   1   4    1   3    0 0.0092 =      9 / 980
1        0 1124    3    2   1   1   3    0   1    0 0.0097 =   11 / 1,135
2        7    0  996    7   1   0   1    7  13    0 0.0349 =   36 / 1,032
3        0    0    8  971   0   4   1   14   8    4 0.0386 =   39 / 1,010
4        0    0    3    0 948   0   4    0   6   21 0.0346 =     34 / 982
5        3    0    1   12   1 859   6    2   5    3 0.0370 =     33 / 892
6        4    3    1    0   3   6 937    0   4    0 0.0219 =     21 / 958
7        1    5   12    3   2   0   0  992   2   11 0.0350 =   36 / 1,028
8        6    1    3    7   3   6   4    2 938    4 0.0370 =     36 / 974
9        5    6    4   13  12   2   0    4   5  958 0.0505 =   51 / 1,009
Totals 997 1139 1031 1015 971 879 960 1022 985 1001 0.0306 = 306 / 10,000

Top-10 Hit Ratios:
    k hit_ratio
1   1  0.969400
2   2  0.990900
3   3  0.995200
4   4  0.998000
5   5  0.998700
6   6  0.999300
7   7  0.999400
8   8  0.999400
9   9  0.999400
10 10  1.000000

The results of all four learning algorithms will be compared in “MNIST Results” in the final chapter of this book.

Football: Default Random Forest

Our third data set is a time series, football results (see “Data Set: Football Scores”), and we have phrased it as a binomial classification: estimate if the home side will win or not. We have two alternatives for the fields to learn from: with or without the bookmaker odds, that is, with or without expert predictions. It is expected to be tougher without the odds.

If you are following along, run either Example 3-6 or Example 3-7 from the earlier chapter, which sets up H2O, loads the data, and has defined train, valid, test, x, xNoOdds, and y. This is the first time we have met a binomial model, and AUC will be our main metric (see “Binomial Classification” for a reminder).

Before we go any further I want to introduce a helper function. Because we are going to be making multiple models, I will often want to analyze them side by side. Example 5-1 shows how to compare metrics for multiple models, on each of our data sets: train, valid, and test. It returns a 3D array, which we can then slice up.

The function is a bit long, but worth studying as it shows both how to use H2O’s built-in functions, such as h2o.auc(), as well as how to hack out the information you want when there is no built-in function: I used str(m) to poke around in the objects, to find the variable names I needed.

The information it needs for the train and valid data is found inside of H2O’s model object; whereas it has to call h2o.performance() to get the same numbers on the test data. The default for labels uses the model IDs. You will see how to use this function later in this chapter. There is a similar function in Python in the online code.

Example 5-1. Comparing metrics for multiple models (in R)
compareModels <- function(models, test, labels = NULL){
#Use model IDs as default labels, if not given
if(is.null(labels)){
  labels <- lapply(models, function(m) m@model_id)
  }

res <- sapply(models, function (m){
  mcmsT <- m@model$training_metrics@metrics$max_criteria_and_metric_scores
  mcmsV <- m@model$validation_metrics@metrics$max_criteria_and_metric_scores
  maix <- which(mcmsT$metric=="max accuracy")  #4 (at the time of writing)
  th <- mean(mcmsT[maix, 'threshold'],  mcmsV[maix, 'threshold'] )

  pf <- h2o.performance(m, test)
  tms <- pf@metrics$thresholds_and_metric_scores
  ix <- apply(outer(th, tms$threshold, "<="), 1, sum)
  if(ix < 1)ix <- 1  #Use first entry if less than all of them

  matrix(c(
    h2o.auc(m, TRUE, TRUE), pf@metrics$AUC,
    mcmsT[maix, 'value'], mcmsV[maix, 'value'], tms[ix, 'accuracy'],
    h2o.logloss(m, TRUE, TRUE), pf@metrics$logloss,
    h2o.mse(m, TRUE, TRUE), pf@metrics$MSE
    ), ncol = 4)
  }, simplify = "array")

dimnames(res) <- list(
  c("train","valid","test"),
  c("AUC","Accuracy","logloss", "MSE"),
  labels
  )

res
}
Note

Close study of that code will show it gets its threshold for test accuracy by averaging the values for train and valid results, but that the train and valid results instead use maximum accuracy. This means the accuracy numbers for test and valid will be slightly overstated, compared to those for test. See Figure 4-2 in Chapter 4 to get a feel for the difference.

I will train two models, the first with all fields (x), the second excluding the odds data (xNoOdds). See Examples 5-2 and 5-3.

Example 5-2. Two default random forest models, in R
m1 <- h2o.randomForest(x, y, train,
  model_id = "RF_defaults_Odds",
  validation_frame = valid)

m2 <- h2o.randomForest(xNoOdds, y, train,
  model_id = "RF_defaults_NoOdds",
  validation_frame = valid)
Example 5-3. Two default random forest models (Python)
m1 = h2o.estimators.H2ORandomForestEstimator(model_id="RF_defaults_Odds")
m1.train(x, y, train, validation_frame=valid)

m2 = h2o.estimators.H2ORandomForestEstimator(model_id="RF_defaults_NoOdds")
m2.train(xNoOdds, y, train, validation_frame=valid)

It finished relatively quickly: about one-tenth of the time it took random forest to train on the MNIST data. Here is how compareModels() can be used to compare the AUC and accuracy scores of each of m1 and m2 on each of the three data sets:

res <- compareModels(c(m1, m2), test)
round(res[,"AUC",], 3)
round(res[,"Accuracy",], 3)

The results show AUC first, then accuracy underneath: they are easy to confuse on this particular data set, as the numbers are close:

      HomeWin HW-NoOdds
train   0.552     0.556
valid   0.637     0.601
test    0.604     0.581

      HomeWin HW-NoOdds
train   0.556     0.561
valid   0.634     0.609
test    0.609     0.599

Our benchmark numbers (the linear model, using just the average bookmaker odds) were an AUC of 0.650 and an accuracy of 0.634 on predicting home wins, and we are well below that here. A comparison of how all models did is in the final chapter; see “Football Data”.

Incidentally, at the top of the summary, it says it made 50 trees, with a max_depth of 20, but it also says the min_depth is 20. Perhaps we should try allowing deeper trees when we try tuning?

Football: Tuned Random Forest

Out of the box, random forest did not do too great at predicting football scores. For this section I will focus on the easiest problem (predicting a home win, when including the expert opinion fields as input fields), and then hope whatever we learned from that will apply to the other model.

As with the other data sets, the first thing we will do is use early stopping. The following parameters say that if there are four scoring rounds with zero improvement on the AUC metric, then stop:

stopping_metric = "AUC", stopping_tolerance = 0, stopping_rounds = 4

That gives us the freedom to request lots of trees, and means one less parameter to tune:

ntrees = 500

The hyper-parameters to try are:

  • max_depth: I will try 20, 40, and 60.

  • mtries: There are 58 columns, so the default (the square root, rounded down) is 7 columns. I will try 5, 7, and 10.

  • col_sample_rate_per_tree: 0.9 or 1.0.

  • sample_rate: 0.5, 0.75, and 0.95.

  • min_rows: 1, 2, and 5.

In search_criteria I set strategy = "RandomDiscrete", and then set max_models to be 54, which is one-third of the combinations, though I interrupted it after 38 models.

By default g1 is giving me logloss, but I want to see AUC, so I will use this code:

g1_auc <- h2o.getGrid(g1@grid_id, sort_by="auc", decreasing = TRUE)
range(g1_auc@summary_table$auc)

The AUC for the 38 models is quite narrow: from 0.644 to 0.668. However, our default model only managed 0.637. When all models in the first grid (which tends to involve quite a bit of guesswork) are better, I get suspicious. The explanation, in this case, is that we moved from the default 50 trees to a generous 500 trees. Even though early stopping means all 500 are never used, it does now get enough trees.

I won’t show the full grid results, and instead will pick out the highlights:

  • The best models use min_rows = 5; in contrast the min_rows = 1 entries are mostly at the very bottom.

  • sample_rate = 0.5 is always in the top half. 0.75 is also good, but 0.95 has consistently done badly.

  • There is no obvious pattern for any of max_depth (hinting that max_depth = 20 is sufficient), mtries, or col_sample_rate_per_tree.

Because our experimental high value for min_rows did best, my next grid will try even higher values (10 and 15). Similarly, a sample_rate below 0.5 will be tried and the 0.95 value dropped (0.45, 0.55, 0.65, 0.75). It took just over 20 minutes to make 24 models.

I’ll save you those results, too, because I found that four of the best five models used sample_rate = 0.45 and four of the best five models used min_rows = 15. So I made another grid, adding two lower sample_rate values (0.25, 0.35), and two higher min_rows values (20, 25). Again, the highest values of min_rows have come out on top, so repeat again! And again 30, 35, and 40 came out on top, so I repeated again. With each new grid I also removed some of the under-performing values, to keep the combinatorial explosion under control.

Even though the best three models come from the highest values of min_rows that I tried (50 and 60), so I could keep trying higher values, I decided to stop at this point because the AUC of the top 8 models are all within 0.001, and I’m getting diminishing returns:

    sample_rate min_rows mtries        auc
110        0.25       50      7  0.6777874
111        0.35       60      5  0.6776425
112        0.55       60      5  0.6776103
70         0.35       35      5  0.6774847
114        0.35       40      5  0.6774416
71         0.35       30      5  0.6771522
116        0.45       60      5  0.6771512
72         0.35       40      5  0.6768411

What about accuracy? Let’s try that compareModels() function; here is how to use it to compare just the top models from a grid:

d <- as.data.frame(g@summary_table)
topModels <- lapply(head(d$model_ids), h2o.getModel)
res <- compareModels(topModels, test)
round(res[,"AUC",], 3)
round(res[,"Accuracy",], 3)

For the best three models, that code gave these results, AUC, then accuracy (I’ve edited the output to show the hyper-parameters for each model):

    sample_rate:  0.25        0.25       0.35
       min_rows:    50          40         60
         mtries:     5           7          5

train            0.612       0.613      0.612
valid            0.678       0.678      0.678
test             0.646       0.646      0.646

train            0.591       0.591      0.591
valid            0.650       0.649      0.651
test             0.634       0.633      0.635

By the rules, we would choose the third model as it gave the best result on the validation data; luckily that also gave the best result on the test data. The AUC of 0.646 is worse than the benchmark linear model (0.650), but the accuracy of 0.635 just beats the benchmark of 0.634. Not enough to justify the extra effort and complexity of a random forest, but at least it shows it is competitive. And we have significantly improved on the default random forest model results (AUC was 0.604, accuracy was 0.609).

A reminder that “Football Data”, in the final chapter, will compare the results of all models.

Summary

Random forests models are generally quick to build, and give effective results on most problems. There are not too many tuning knobs and, looking back over this chapter, the most effective technique was to increase ntrees, in combination with early stopping. Increasing max_depth was also effective.

Random forest is not the only way that the basic decision tree idea has been improved, though, and the next chapter will look at an alternative approach.

1 I used a seed of 999, both here and for the data split. But you can still see different results if you are using a different version of H2O.

2 See code/makeplot.building_energy_results.R, in the online code repository, for the code used to make this chart.

3 To stop this chart from being hopelessly crowded, only the first 75 test samples are plotted.

4 Probably because I set a random seed: normally you’d expect a bit more variation even when a change of parameter has no effect at all.

5 I also tried MAE, which gives a more distinct gap between the boxes.

6 Arbitrarily using the first of our six models.

7 Yes, actually, it was. If I try with the second best model from the grid, I get an error rate of 288, and would have concluded enhanced data was worth an improvement of 17! The error rate, on test, for the other six models was 310, 292, 312, 308, 320, 313, respectively; so a different random seed can cause a range of at least 18.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.187.233