Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Deep Learning (Neural Nets)

Deep learning is the new and trendy name for neural networks, and it sure is trendy! But deservedly so, as it is behind some of the most spectacular advances in AI and machine learning at the moment. Deep learning algorithms tend to be the best performers at problems that humans find easy yet (other) machine learning approaches find difficult, such as pattern recognition. Theoretically they can solve any problem, but they have their downsides too: they can be slow, they are black boxes and cannot explain their thinking, and they struggle a bit with categorical inputs. If your problem is to take a company’s annual transactions and calculate how much tax is owed, a neural net is not the right choice.

If you’ve used another library for neural nets or deep learning, one complaint you won’t have about H2O’s implementation is ease of use. As we saw back in Chapter 1, it takes care of most of the details for you, and you can get good results with a one-liner. Yes, there are still a huge number of parameters to tune but, as we will see in this chapter, the majority of them never need to be touched.

As in the other chapters, we will take look at how they work, but only the parts you need to understand to effectively tune them, then we will go through the parameters, and then dive into using deep learning on each of our data sets, first with defaults, then going through the tuning process.

However, a few special points to note. First, the use of deep learning as an auto-encoder, i.e., unsupervised deep learning, is instead in Chapter 9. Second, there are so many parameters that some of them (the ones I have never needed to touch) are in an appendix at the end of this chapter. Third, deep learning grids can be slow. If you have faithfully followed along with every code example to this point, you may need more patience or more hardware.

Tip

Even if you are sure deep learning is the model for you, it can be worth making some quick models with the other algorithms, if only to get an idea of what a good score on your data set will be. If deep learning is doing better than all the other algorithms, but you seem to have hit a wall, maybe that is just the best that can be done without overfitting. But if it is doing worse, chances are there is still some parameter you can tweak to improve performance.

What Are Neural Nets?

I am supposed to mention the human brain here, but let’s cut to the chase: a neuron is a function that takes multiple numeric inputs and gives out one numeric output. These neurons are organized into layers, and the outputs from all the neurons in one layer become the inputs for each neuron in the next layer.

Note

I am only describing the implementation in H2O, which is called a feed-forward neural network. The H2O implementation is designed to be run in parallel across a cluster on very large data. At the time of writing, H2O does not support GPUs.¹

As Figure 8-1 shows, the very first layer is your data: a training sample, or a test sample, or real inputs once you are in production. And the very last layer is your outputs: the answer. If you are doing a regression (learning a single value) then the output layer will have one neuron. If you are doing a classification then the output layer will have one neuron for each possible answer (and each output value will be a probability for that answer: the answer with the highest probability for your set of inputs is chosen). The layers between the input layer and the output layer are called the hidden layers.

Each neuron in each hidden layer has a weight for each of its inputs, and modifying those weights is how the network learns. (There is also a “bias” input to each neuron, which can be thought of as a weight connected to a constant input; it is also tuned during training.) The functions inside the neuron are discussed later, in “Activation Functions”.

The idea is you start with random weights, then you give it the first training sample and the correct answer (this is supervised learning, remember), calculate the error, and then use that to go back and tweak each of the weights, so that there is a bit less error.² Then you take the second training sample and repeat. Working your way through every piece of training data is called an epoch. You specify the number of epochs³ you want the algorithm to perform.

If you’ve used or studied other deep learning libraries you may have met mini batches. A mini batch of, say, size 32 means it will process 32 training samples, then go back and update the weights in one batch. H2O does not use mini batches, which means it works exactly as just described: the weights get updated after every single sample. However, when run across a multinode cluster, each node in the cluster works independently for one iteration, and at the end of each iteration the network weights are averaged with those on every other node. One iteration can be larger or smaller than an epoch; the parameters that control this are discussed later in this chapter.

Numbers Versus Categories

Everything is a number in a neural net, and they are happiest when all your predictor variables are numeric. For instance, when we try it on the MNIST digit recognition problem there will be one input neuron per pixel, and the value will be the intensity of that pixel. However, when you have a categorical input, each possible value of that category will become one input neuron; one of that set of input neurons will be set to 1 and all the others will be set to 0. This is called one-hot encoding.

Let’s say you have a gender field, and the possible values are “Male,” “Female,” “Unknown.” You also have an age field, which is a number. This means we will have four⁴ input neurons:

Male?
Female?
Gender-unknown?
Age

For a 21-year-old man, the inputs would be:

Male = 1
Female = 0
Gender-unknown = 0
Age = 21

For a 55-year-old, who didn’t answer the gender question, the inputs would be:

Male = 0
Female = 0
Gender-unknown = 1
Age = 55

Luckily the H2O implementation takes care of all this for you: give it a data set with a mix of numeric, integer, and enum variables, and it will do whatever data manipulation has to be done.

Numbers or Categories? (Ordered Factors)

By the way, what if your data had age as a categorical input? For example:

under 18
18 to 24
25 to 39
40 to 59
60+

In R terminology, this is an example of an ordered factor: “under 18” is less than “18 to 24” is less than “25 to 39” is less than “40 to 59” is less than “60+”. Should this be five input neurons to your network, or should you convert it to a single numeric input?

Tricky. In this case I’d leave it as five categories, because it is not obvious how to convert it to a number ([1, 18, 25, 40, 60]? [18, 25, 40, 60, 120]?). And, also, because the intent behind that data wasn’t to know their age, it was a proxy for their lifestyle. However, if the age field was made up of 20 categories, each a nice, clean, five-year range, I’d be much more tempted to convert it to a single value.

Network Layers

The main two things you need to concern yourself with, when using h2o.deeplearning(), are the number of epochs (how long you are willing to spend training) and the shape of the network (the number of layers, and the number of neurons in each of those layers). The more layers and neurons you have, the longer it will take to train (and the slower it will be to use), so you want as few as you can get away with. While theoretically⁵ one hidden layer is enough to represent all your data, whether involving nonlinear relationships or not, in practice it won’t be. Anyway, most people wouldn’t consider it deep learning if you only had one hidden layer!

Some hints for choosing the number of layers and neurons:

For nonlinear problems, start with two layers and see how it does.
The more nonlinear your problem, the more layers you need. If you feel it is just not getting it, try adding another layer. Or more epochs. Or more training data. But if you are up to five layers, a sixth is probably not going to help (and you are looking at a lot of training time).
The more neurons in a layer, the more clearly it will be able to understand the data. If you feel it has the general idea, but is a bit fuzzy, try adding more neurons. Or more epochs. Or more training data.
The more data inputs you have, the more neurons you are likely to need in the first hidden layer. Maybe.
The more output neurons you have, the more neurons you are likely to need in the final hidden layer. Maybe.
The more layers you have, the more likely you are to benefit from a dropout function (described in “Activation Functions”).

Tip

You’ll be able to see if more epochs is going to help or not by looking at the scoring history plot (on Flow, or plot it yourself with data from h2o.scoreHistory(m) in R, or m.scoring_history() in Python). If the line wobbles erratically, and the overall trend is sideways, not down, then more epochs is unlikely to help; try one of the other ideas. If it is merely getting rather flat, you have entered the realm of diminishing returns but, if you don’t mind waiting, then more epochs might give a small improvement.

The time spent training a deep learning model is primarily decided by the number of training samples times the number of epochs. Which is a shame, as more of both is better, though with diminishing returns. But how long you need to spend training (to reach the same point of diminishing returns) is related to the number of weights in your model: the more you have, the more there is to learn. The number of weights between two layers is simply the product of the number of neurons in those two layers. Don’t forget to count the input layer and the output layer.

Consider a 100x100 network, with 2 numeric inputs and 1 output. It has (2 * 100) + (100 * 100) + (100 * 1) = 10,300 weights. If you add a third layer, also with 100 neurons, it goes to (2 * 100) + (100 * 100) + (100 * 100) + (100 * 1) = 20,300 weights. It (hopefully) will understand your problem better, but might need more time to settle down and converge.

Now consider that 100x100 network, but with 10 numeric inputs, and classifying into 3 states. The number of weights is now (10 * 100) + (100 * 100) + (100 * 3) = 11,300. That is, there are 800 more between the input layer and the first hidden layer, and 200 more between the second hidden layer and the output layer, but the total is still dominated by the weights between the two hidden layers.

Now take the same-sized network, but this time with 5 numeric inputs and 5 enum inputs: gender (2 levels), favorite color (10 levels), Myers–Briggs personality type (16 levels), astrology sign (12 levels), and Chinese horoscope sign (12 levels). Yep, we’re making the world’s best online dating site. Now how many weights? Remember the earlier discussion: each enum level becomes an input neuron (plus an extra input neuron in each category to handle missing or unseen values). We now have 5 + 3 + 11 + 17 + 13 + 13 = 62 input neurons, so our total number of weights is (62 * 100) + (100 * 100) + (100 * 3) = 16,500. We still only have 10 input columns in our data, but suddenly we have 50% more weights.

Why should you care? The more weights, the slower training will be. And because if you naively use a factor with 50,000 levels (e.g., zip code), and 1000 neurons in your first hidden layer, you might have 50 million more weights than you were expecting .⁶

Activation Functions

If you remember the diagram of the neuron (Figure 8-1), near the start of this chapter, you know it has lots of (weighted) inputs coming in, and one output going out. The inputs are summed, and then it is the activation function that decides the value of the output.

H2O supports three activation functions:

Rectifier: The most common activation function, and the default. It outputs the sum of its weighted inputs, but clips all negative values to zero. See the upper line in Figure 8-2. This implies that it will be generating quite a few zeros (zeros are good for training deeper networks), but also means that positive values are unbounded.

Tanh: Short for hyperbolic tangent.⁷ This takes an input range of negative infinity to positive infinity and converts that to an output range of –1 to +1. But it varies most rapidly when the sum of the inputs is close to zero. See the lower line in Figure 8-2.
Maxout: This simply outputs the highest (the max) of the inputs, meaning the weighted inputs are used directly, not summed.

You can find claims of “the best” for each of these activation functions, but it does really seem to depend on your data, so wherever possible use a grid search to try all three options. However, Rectifier is generally quicker, so if in doubt go with that. And if you get complaints about numeric instability with Rectifier or Maxout, switch to Tanh.

H2O supports the preceding three activation functions, but there are six possible values for the activation parameter. This is because each of the above has a “WithDropout” variant, which allows using hidden_dropout_ratios to control the rate at which outputs are randomly set to zero, as a regularization technique (to avoid overfitting and give a more robust model).

Parameters

Of all the H2O algorithms, deep learning has the most parameters. You could spend the rest of your life trying to tune them all. Some were introduced in Chapter 4; a lot of the advanced ones have been moved to an appendix at the end of this chapter. From what is left most have been divided into either scoring-related or regularization-related. But the first one that you will often be setting is to describe how many hidden layers there will be and how big each should be:

hidden: Hidden layer sizes. The default is 200,200, meaning there will be two hidden layers, with 200 neurons in each layer. If you give 60,40,20 then you will have three hidden layers, with 60 in the first, 40 in the second, and 20 in the third.

The next parameter sets the mode of operation, deciding if you want an auto-encoder or a supervised network. Auto-encoding with deep learning is covered in “Deep Learning Auto-Encoder” in Chapter 9:

autoencoder: Defaults to false (meaning to do supervised learning). Set this to true if auto-encoding. Usually you should set activation to Tanh at the same time.

Deep Learning Regularization

The idea of regularization was introduced in “Sampling, Generalizing”. For deep learning, there are two main things you can try:

Drop connections when connecting one layer to the next.
Regularization.

The activation function is here, as the choice decides what other parameters you can use:

activation: This parameter has six possible values, because it is doing two things. First you can choose between three activation functions (introduced in “Activation Functions”). Then you choose whether to use dropout or not. In other words, if you want to use hidden_dropout_ratios you must specify one of “TanhWithDropout,” “RectifierWithDropout,” or “MaxoutWithDropout.” The choice of activation cannot be changed when using checkpointing, and also you can’t have a grid with some models using hidden ratios, some not. So, in those cases, I recommend you always use one of the “WithDropout” activation functions, and set hidden_dropout_ratios to 0.0 when you don’t want any dropout.
hidden_dropout_ratios: You specify one ratio per hidden layer. The default of 0.5 means for each training row that is processed through the network, there is a 50% chance a neuron will pass its value on to the next hidden layer, and a 50% chance it will pass on zero. It is ignored if not using an activation function that supports dropout. Hard to choose intuitively, so it is best experimented with in a grid. Try a higher dropout rate in networks with a higher number of layers.
input_dropout_ratio: This ratio says what percentage of the input neurons to feed into the first hidden layer. Unlike hidden_dropout_ratios it can be used with any activation setting. The default is 0.0 (meaning no input dropout). If it is 0.5 then it means for each row in the training data, there is a 50% chance of each feature being used. Rephrasing, it sets half your columns to zero. A different half on each training sample, and on each epoch. This can work well if there is a lot of noise in your data, though 0.5 is quite high. It is a good one to use in a grid, perhaps initially with a wide range, e.g., 0.0, 0.1, 0.2, 0.3, 0.4, and 0.5.
l1: L1 regularization. Also known as lasso regularization. Defaults to 0, and typical values to try are 0.0001 or smaller.
l2: L2 regularization. Also known as ridge regularization. Defaults to 0, and typical values to try are 0.0001 or smaller.
max_w2: An upper limit for the (squared) sum of the incoming weights to a neuron. The default is to have no limit. This is a very direct way to stop weights from growing too big.

Deep Learning Scoring

There are a large number of parameters to control the frequency of scoring. They allow you to control the conflict, or balance, between a few concepts:

You need accurate scores (to judge models, to know when to early-stop).
Time spent scoring is time not spent training.
Regular scoring means finer-grained choice for returning the best model.
Cluster considerations.
Training data size.

The following sidebar walks through how they work together in a realistic example. You may want to keep referring back to it as you read the parameter descriptions.

Deep Learning Scoring Example

I will assume a 3-node cluster, and I will assume 120,000 training rows, with 40,000 on each node. Early stopping defaults to stopping after five scoring rounds with no improvement.

Each node starts training on its 40,000 rows, from random weights, and effectively each is building its own deep learning model based only on the 40,000 rows. After every single row the weights/biases are updated,⁸ but there is no network communication yet.

It does this for one iteration, the length of which is controlled by train_samples_per_iteration. If you left that as the default of –2 it will be decided based on other factors (such as target_ratio_comm_to_comp), but typically it will be a fraction of an epoch for large data sets, and could be dozens of epochs for small data sets. A higher value for score_interval or a lower value for score_duty_cycle can also mean an iteration takes longer.

At the end of the iteration, each node will stop and share its weights/biases with every other node. They get averaged (so that each node now has identical weights). Then each node will score on 3333 of its 40,000 samples (because score_training_samples defaults to 10,000), and then score on all the validation data (because score_validation_samples defaults to zero).

Now a new entry is made in the scoring history of the model. If it is the best model so far, a snapshot of the model is saved. And then each node starts training again, still on just its own 40,000 rows. At the end of the second iteration, the weights/biases are shared and averaged again.

It will consider early stopping for the first time after 10 iterations. It will make an average of the scoring metric for rounds 1 to 5, and the average of the scoring metric for rounds 6 to 10. If the second number is equal to or lower than the first number, it stops. Otherwise it does another iteration (and will compare the average of rounds 2 to 6 against 7 to 11).

If train_samples_per_iteration was 0, each node would do 40,000 samples per iteration. If –1, each node would do exactly 120,000 samples per iteration. If I set it to 240,000 (twice the size of my training data), each node would do 80,000 samples per iteration, and scoring would be every 2 epochs. But, be cautious about setting this higher than say 100,000 (whatever the training data size), as you want your nodes to share their weights with each other fairly frequently. Leaving train_samples_per_iteration as –2 is usually best when using multiple nodes.

train_samples_per_iteration: This controls how many training rows (samples) to use per iteration; an iteration can be thought of as when the model is scored. It is an integer, and normally you will use one of the following three special values. –2 is the default, and lets H2O decide. 0 and –1 mean the same thing when on a single node: score every epoch. See “Deep Learning Scoring Example” for how they differ when using multiple nodes.
score_interval: The minimum time between scoring models, in seconds. The default is 5. If stopping_rounds is also 5, meaning there will be a minimum of 10 scoring rounds, then you know the model will build for at least 50 seconds. By the way, if other parameters (e.g., low score_duty_cycle, high train_samples_per_iteration) mean that scoring rounds are already, say, 30 seconds apart, then changing this from 5 to, e.g., 15 or 20, will have no effect.
score_duty_cycle: How much time to spend scoring, versus training. The range is 0.0 to 1.0, where lower values mean more training, while higher values mean more scoring. The default is 0.1 (10% of the time spend on scoring, 90% on training). If you set it to 0.01 then typically it will be 10 times longer between scoring events which, for instance, might mean you see a scoring history entry every 50 epochs instead of every 5 epochs.
target_ratio_comm_to_comp: Target ratio of communication overhead to computation. The default is 0.05, spending 5% on communication between nodes, and 95% of time on training each node. This only matters for multinode clusters, and also it is only used when train_samples_per_iteration = -2. Lowering it will either mean the scoring rounds are further apart (implying fewer of them), or have no effect at all.
replicate_training_data: Defaults to true. If true then it will replicate the entire training data set on every node in your cluster. For small data sets this can result in faster training.
shuffle_training_data: Defaults to false. If true then training data is randomly sorted. This is recommended if you have set balance_classes (see “Data Weighting” in Chapter 4), for instance.
score_validation_samples: How many of the validation data set rows to use when scoring. The default of 0 means to use them all. If your validation data is large, or you are scoring more frequently, you might want to choose a lower number to speed up scoring (at the expense of accuracy); personally I would instead try to score less frequently. When using cross-validation, the fold that is not used as training data is treated as the validation data, and this setting also applies to that too.
score_training_samples: Like score_validation_samples, but for when scoring on the training data instead of a validation test set. The default is 10,000, to make sure that very large data sets do not make scoring really slow. If scoring frequently, you might want to make this even lower. When using cross-validation, this is only used when making the final model.
score_validation_sampling: Only used when score_validation_samples has been changed from the default of 0. Defaults to “Uniform,” but can also be “Stratified” (which might give better results if doing a classification and the target class is unbalanced).

Tip

If you feel that scoring is happening more frequently than you need, lowering score_duty_cycle or increasing score_interval is often best. Explicitly setting train_samples_per_iteration would also do the job. If you feel early stopping is triggering too early, you could also do any of those, but in that case the best fix is often to simply increase stopping_rounds. For example, using checkpoint to restart a model, with stopping_rounds doubled, works well .

Building Energy Efficiency: Default Deep Learning

You know by now that this is a regression problem, unless you have jumped straight here, in which case you can learn about it at “Data Set: Building Energy Efficiency”. Run either Example 3-1 or Example 3-2 from the earlier chapter, which sets up H2O, loads the data, and defines train, test, x, and y. We are using 10-fold cross-validation, instead of a validation set. (See “Cross-Validation (aka k-folds)” for a reminder.) The cross-validation means it has 11 times the work to do (10 folds, plus making the final model), which is sometimes a problem as deep learning is already quite slow. But this data set is small enough for it to be manageable:

m <- h2o.deeplearning(x, y, train, nfolds = 10, model_id = "DL_defaults")

In Python use:

m = h2o.estimators.H2ODeepLearningEstimator(model_id="DL_defaults")
m.train(x, y, train, nfolds=10)

That took just over 10 seconds to run, with all 8 cores fully used. It gave an average MSE of 8.15 across the 10 folds (with a standard deviation of 1.40), but a lower 6.60 on the test data.⁹

As in previous chapters, Figure 8-3 shows the predictions on the test data. The black circles are the correct answers, the 13 up arrows indicate where it was over 8% too high, and the 34 down arrows indicate where it was over 8% too low. (The small squares are where it was within 8%.)

Notice that, out of the box, this was a worse result than either of random forest or GBM, and only just slightly better than GLM. The results of all models on this data set will be compared in “Building Energy Results” in the final chapter.

Building Energy Efficiency: Tuned Deep Learning

Deep breath. There is so much we can do here. So many parameters, so little time. There is time pressure for another reason: typically the deep learning models take more CPU time than the other algorithms we’ve looked at. That means we want to keep our grid searches as focused as possible. (See “Grid Search”, back in Chapter 5, if grids are new to you.)

As usual, the first thing I want to do is use early stopping (see “Early Stopping”) so that I can give each model lots of epochs, and not have to worry about it. For the first few grids, however, I will be more severe than I was with the other supervised learning algorithms: for deep learning, if it hasn’t improved at least 0.5% over the last three scoring rounds, it stops. My hope is that this severity affects all models in the grid equally.¹⁰ Here are the parameters:

stopping_metric = "MSE",
stopping_tolerance = 0.005,
stopping_rounds = 3,
epochs = 1000,
train_samples_per_iteration = 0,
score_interval = 3,

Note

The default stopping metric, for a regression problem, is “deviance,” which is short for “mean residual deviance,” which is identical to MSE when the distribution is gaussian (which is another default). I decided to specify it explicitly, in case other distributions get tried.

You will see I have also set train_samples_per_iteration to be 0, which means it will score after every epoch (which in this case means after every 600 to 625 training samples). And score_interval has been slightly reduced to 3 seconds, from the default of 5 seconds. This means that early stopping should react more quickly. In particular, because the data set is so small, I was finding the default settings meant it was doing so few scoring events that it kept reaching maximum epochs.

To see the effect of simply adding more epochs, how about we give the previous settings a go, with everything else still set to default? I did, and the results¹¹ were so good, I had to run it again to see if it had somehow just got lucky. So, here is a comparison of the default model, with those two models that used 20 times more epochs:

           Default  Early#1  Early#2
Train-MSE   5.587    0.223    0.092
   CV-MSE  17.510    4.854    4.908
 Test-MSE   7.089    0.580    0.437
   Epochs  11.519  194.000  192.000

The model is way better on all three data sets. The cross-validation models ranged from using 117 to 327 epochs. The standard deviation on the 10 cross-validation model scores was around 0.60. You may be asking why the “CV-MSE” row is so high; see the sidebar at the end of this section (“The CV Metric Mystery”).

Note

If you look at a plot of the scoring history, you might be surprised to see very few entries: no nice curve, just one or two straight lines. This is because of using the combination of cross-validation and early stopping: it can see how many epochs were needed in the cross-validation models, so it uses that many, and switches early stopping off. If you look at any of the 10 cross-validation models you will see more of a curve in the scoring history.

So, from that good start, how much further can we take it? The first thing to experiment with is the network layout: how many layers, and how many neurons in each. I know the problem is nonlinear, so I feel I need two hidden layers, but I don’t see it as so complicated that it will need more than three layers. We have only 18 input neurons¹² so I’m going to try first hidden layer sizes of 54, 162, and 324 (18 times 3, 9, and 18, respectively).¹³ I then tried halving (except 54), doubling (except 324), or keeping the second layer the same size. And where I tried a third layer, I kept it the same size as the second layer. That gave 14 combinations. To speed things up a bit, I dropped from nfolds = 10 to nfolds = 6.

If I lost you with all those combinations, refer to the next table, where the results are shown (ordered by cross-validation results):

        hidden train.mse xval.mse     sd epochs   time
1  324,324,324     0.347    4.368  0.494    177   23.4
2        54,54     0.087    4.712  0.556    982    5.2
3      162,162     0.059    4.744  0.446    525   10.4
4      324,162     0.096    4.788  0.503    236    7.0
5  324,162,162     0.162    4.946  0.479    171    9.9
6    162,81,81     0.270    4.978  0.457    418    8.0
7   54,108,108     0.022    5.005  0.581    518    7.4
8  162,324,324     0.135    5.026  0.322    158   17.5
9  162,162,162     0.152    5.065  0.629    243   10.0
10      162,81     0.035    5.077  0.649    623    7.9
11     162,324     0.087    5.147  0.520    244    9.3
12      54,108     0.015    5.242  0.573    836    6.3
13     324,324     0.131    5.456  0.627    205    8.4
14    54,54,54     0.049    5.983  1.173    720    6.5

Not much to conclude from all that, is there! I actually ran it again and, unhelpfully, everything shuffled around. From these results I feel no strong need to give it more neurons or more layers. However, more epochs might bear fruit, as even with quite strict early stopping a few of the models are hitting the ceiling.

The next thing we want to consider is the best value for activation, and if dropout and/or regularization helps. There are two types of dropout: between the input neurons and the first hidden layer (input_dropout_ratio), and then leaving each of the hidden layers (hidden_dropout_ratios). There are only 8 input columns (18 input neurons) so high values of input_dropout_ratio can be expected to do badly.

Tip

When experimenting with both hidden and hidden_drop_ratios in the same grid, you must use a constant number of layers. If you want to compare, say, some 2-layer networks with 3-layer networks in the same grid, run h2o.grid twice, with the same grid ID each time: once for 2-layer, then again for the 3-layer ones. That is what I will do here.

We will use the following hyper-parameters for the grid:

If 3 layers then just 324,162,162; if 2 layers try both 54,54 and 162,162
RectifierWithDropout, TanhWithDropout, or MaxoutWithDropout
Hidden dropout ratios of 0 (no dropout), 0.1 (a little dropout each time), 0.2 (drop 20%), and 0.5 (drop 50%—this is the default)
Input dropout ratios of 0 (no dropout) or 0.1 (10% of inputs ignored)
L1 regularization of 0 (none) or 0.00001 (1e-05)
L2 regularization of 0 (none), 0.00001 (1e-05), or 0.0001 (1e-04)

If you have already read Chapter 7, you might remember L1 and L2 regularization. L1 regularization, in neural nets, causes the neurons to use fewer of their inputs (the most significant ones, hopefully!); this might make them more resistant to noise (not an issue in this data set, so the expectation is that L1 regularization will not help—that is why I only try one value). L2 regularization reminds me of Tall Poppy Syndrome, because the biggest weights get knocked down, and the smaller weights survive unscathed. It encourages the network to use all its inputs a bit, and not just use a few of them. L1 and L2 seem in conflict, yet one-third of the models in this grid will try both together. We could use two grids to avoid this, but how about we just try it and see what happens?

That was a lot of combinations, so I set the grid search to use strategy = "RandomDiscrete" with max_models = 50 (for each of 2 layer and 3 layer). It took rather a long time.¹⁴

The results were quite clear: the best models used no dropout at all. The top 3 (and 9 of the top 12) were all zeros for the hidden layer dropout. The top 6 (and again 9 of the top 12) were zero for the input dropout. The models with hidden_dropout_ratios=0.5 were definitely the worst performers.

The choice of activation function seemed quite minor. For two hidden layers, Tanh or Maxout, and no hidden dropout. For three hidden layers, Rectifier or Maxout, and either no dropout, or a little. Rectifier is notably quicker than Tanh or Maxout.

For L2 regularization, half the top models use 0.00001, half use 0. So it seems to neither help nor hinder. For L1, it is again about 50-50 between 0 and 0.00001; however 0.0001 seems to make results worse.

At this stage I am quite happy with this, and don’t feel the need to try other parameters. I’m going to take our best model, and run it again, with nfolds=10, and less severe early stopping, and allow it up to 2000 epochs, then use that on the test data:

m <- h2o.deeplearning(
  x, y, train,
  nfolds = 10,
  model_id = "DL_best",

  activation = "Tanh",
  l2 = 0.00001,  #1e-05
  hidden = c(162,162),

  stopping_metric = "MSE",
  stopping_tolerance = 0.0005,
  stopping_rounds = 5,
  epochs = 2000,
  train_samples_per_iteration = 0,
  score_interval = 3
  )

This model ended up using “only” 479 epochs. On the training data it managed an MSE of 0.148, both the best we’ve seen in this book. “Best on training data” can be another way to say “most overfitted,” but that is not the case here, as it also gives excellent results on the test set; here it is shown next to those we got earlier:

           Default  Early#1  Early#2    Best
Train-MSE   5.587    0.223    0.092    0.148
   CV-MSE  17.510    4.854    4.908    4.619
 Test-MSE   7.089    0.580    0.437    0.434
   Epochs  11.519  194.000  192.000  196.000

Yes, our Best model is best, but only just: all the benefit came from giving more epochs. In fact I’ve made this model four times now, and the test-MSE has been 0.425, 0.434, 0.581, and 0.605. Out of the 143 samples, just three are more than 8% too low, and none were too high. Figure 8-4 shows just the first 75 samples, and has just one down triangle.

Note

The model took 3 to 4 minutes to train (including making the 10-fold cross-validation). While it is the best performer, it is also the most CPU-intensive.

The CV Metric Mystery

Earlier I promised I’d take a look at why the cross-validation results are so different. These numbers are from a run when the MSE on the test set was 0.425 (slightly better than shown elsewhere).

Here are the MSEs for the 10 cross-validation models:

3.486  5.882  4.903  4.172  5.186
5.232  4.087  3.328  6.048  4.285

The mean is 4.661, and the standard deviation is 0.63.

Under the model information I see another, slightly different, MSE metric: 4.724274. It is described as “10-fold cross-validation on training data (Metrics computed for combined holdout predictions.”

But just above that I see “MSE: 0.128,” which is even better than the 0.425 on the test data set. This is the model that was made based on all the training data, and that score is the score when evaluated on all the training data.

This might be a good time to remind you that the cross-validation metrics are solely about scoring your models;¹⁵ they don’t do anything else. They are just thrown away, and the actual model that is returned is trained on 100% of your training data.

This data set is quite unusual in that there is no noise, no repeats: it contains exactly one sample of each building type. Of the 768 rows (in the full data), 143 (20%) are test, 562 (80% x 90%) are (constantly changing) training data, and 63 (80% x 10%) are (constantly changing) validation data. What I think is going on is that 562 is not quite enough to be representative, but when it is given the whole 625 (the full 80%) training rows, it crosses a threshold and is able to make a jump in understanding, and apply these new generalizations to the unseen test data.

Deep learning was the only algorithm that behaved this way: in all the others I saw mean cross-validation MSEs about the same as the MSE on test data.

MNIST: Default Deep Learning

This is a pattern-recognition problem (see “Data Set: Handwritten Digits”, if you are not familiar with it), and so I have high hopes that deep learning is going to give the best results on it. It is a multinomial classification, trying to decide which of the digits 0 to 9 a set of 784 pixels represents.

First run either Example 3-3 or Example 3-4 from the earlier chapter, which sets up H2O, loads the data, and has defined train, valid, test, x, and y (no cross-validation this time because we instead have a validation data set). Then run either Example 8-1 or Example 8-2.

Example 8-1. Default deep learning on MNIST (Python)

m = h2o.estimators.H2ODeepLearningEstimator(model_id="DL_defaults")
m.train(x, y, train, validation_frame=valid)

Example 8-2. Default deep learning on MNIST (in R)

m <- h2o.deeplearning(x, y, train,
  model_id = "DL_defaults", validation_frame = valid)

It took just over three minutes, maxing out all eight cores on my machine. It first tells me it has dropped 67 constant columns (they were the pixels we identified as being zero in all samples when we first looked at the data). So there will be 717 input neurons, rather than 784.

Doing summary(m) (m.summary() in Python) gives details of each layer, followed by metrics such as MSE, then the cross-validation table for each of train and valid. There is a random element, so the exact numbers will be different on each run, but they will be similar. The standout feature of the result, for me, is summarized by Figure 8-5, a screenshot from the Flow interface.¹⁶

There is quite a lot to discuss here, but the big thing is the gap between the blue (lower) line, which is performance on the training data, and the orange (upper) line, which is performance on the validation set.

Those charts are interesting in other ways:

They wobble about, so don’t give up at the first uptick.
For the validation line, logloss and MSE are quite distinctly different (not so much for the training line).
That uptick/downtick at the end. What is that all about?

What it is all about is that the overwrite_with_best_model parameter is true. It is using logloss to judge, and according to logloss the best model was just before epoch 5, so at the end of training it went back and used that model. That explains the downtick in the top line in the logloss chart; the uptick in the lower line is because on our training data it didn’t think that was the best model. In the MSE chart, both jumped up at the end. This difference comes down to logloss and MSE disagreeing about the best model. (See “Classification Metrics” for a reminder about the available metrics for multinomial classifications.)

Let’s quickly run h2o.performance(m, test) to get performance on the test set. So, we have an error rate of 0.55% (train) versus 3.32%/3.04% (valid/test), and MSE of 0.005 versus 0.028/0.027. That ghastly demon, overfitting, has decided to show up and spoil the party. When tuning we will, therefore, want to be very aware of this.

Note

In the model summary, it told me the “Metrics reported on temporary training frame with 9960 samples” for the training data, rather than all 50,000 samples. It was doing this for speed. It gave an error rate of 0.45% and MSE of 0.0039.

If you want to see the results on all 50,000 samples, the command is h2o.performance(m,newdata=train). It was the numbers from that command that I quoted above; they are slightly worse, but close enough to make no real difference.

By the way, just as with the other models on default parameters that we have looked at in this book, the top 10 hit ratios chart shows it was still having trouble getting a few of them right even on its eighth or ninth guess.

Note

You might notice the validation and test results were very close. I did a test on a large (40) and realistic (each taking at least 15 minutes to build) set of tuned models on this data set, and I got 0.97 correlation on the error rate with validation and test results (0.98 for MSE and logloss). This is very good: it means we can tune for improvements in the validation set, safe in the knowledge that those improvements will carry over to our unseen test data. In absolute terms the difference in errors on 10,000 test cases ranged from +18 (model got 18 more right on the validation set) to –19 (model get 19 more correct on the test set), on the better models. The gap was a bit bigger on the weakest half a dozen models (up to –30). That means the stronger the model is, the more reliably validation data metrics indicate performance on test data.

MNIST: Tuned Deep Learning

We are going to be using the enhanced data, so use "load.mnist_enhanced.R" or "load.mnist_enhanced.py" (see “Helping the Models” for more on what was added). First question: what difference did enhanced data make on the default settings, still sticking with the default of just 10 epochs? I tried two more runs on the “raw” MNIST data, and got (validation data set) errors of 417 and 390. (It was 332 on our run in the previous section. This high variance is common when not using enough epochs.) On the enhanced data I got 317 and 313 errors (and 292 with a later run), so over a 20% improvement.¹⁷ The reduction in training set errors was even greater. Summary: deep learning finds it much easier to overfit our enhanced data, but the benefits on unseen data are also significant.

The next thing to try is to use more epochs, and therefore we want to set early stopping to keep computation under control:

stopping_metric = "misclassification",
stopping_tolerance = 0.01,
stopping_rounds = 3,
epochs = 500,
classification_stop = -1,

These settings say it will only stop if there has been less than a 1% improvement, in misclassification,¹⁸ over a sliding window of three scoring rounds. Oh, and if it just keeps on getting better and better, pull the plug after 500 epochs. That is strict: we’ll let it train a bit more once we narrow the best parameters down. classification_stop is zero by default, meaning it stops learning once it perfectly classifies the whole of the training data. But the model will keep improving its validation score even once this happens, so we want it off. Consider always switching it off (setting it to –1) when using early stopping.

Let’s jump straight in, with enhanced data and the early stopping, and see how it does.

I did two runs and they hit the early stopping after 40 and 46 epochs. Both sucked all the marrow out of the training data—in fact the first run got a perfect score when evaluated on the training set. Validation errors were 268 and 272, respectively (out of 10,000). This compares to 317 and 313 when only given 10 epochs, so we’ve got another 17% improvement.

The next thing to think about is how many hidden layers, and how many neurons in each? And what other parameters are going to be important?

The challenge we are setting for this deep learning model is to look at groups of those pixels and form concepts that can be used to decide which digit it is likely to be. Because each layer is fully connected to every neuron in the previous level, as long as you have enough neurons in a hidden layer, it can learn lots of concepts in parallel. However, more layers is also going to help, so it can build more and more advanced concepts on top of lower-level ones.

There is an article on deep learning performance by Arno Candel¹⁹ where he has tuned many parameters on the MNIST data set. Much of that article’s emphasis is on tuning for speed, and most experiments are on just 0.1 epochs (representative for testing speed-ups, not representative for evaluating quality), but I will shamelessly steal what I can. His very best model used 1024,1024,2048 neurons. Think back to that idea of building advanced concepts on lower-level ones, in the previous paragraph—does the final layer need more neurons to handle those advanced concepts, perhaps?

Tip

One practical reason to favor increasing neurons between layers, rather than decreasing, is that we have 898 input neurons, and only 10 output neurons. The number of input neurons gets multiplied by the number of neurons in the first hidden layer. So a 10,20,30 model needs 10,080 weights, but a 30,20,10 model needs 27,840 weights. Fewer weights means quicker training times.

As for other parameters, I already see overfitting on the training data, so I want to tackle that. There are both more training samples, and more columns, than in the building energy data set we previously looked at. So I will experiment with all of the following in my first grid:

l1: L1 regularization. Might help make the model more resistant to noise. Trying 0 (no regularization) and 1e-5 (a little).
input_dropout_ratio: Drop some of the input neurons. 0.1 means it will randomly set to zero 10% of the pixels. A different random set of pixels on each sample. The grid will compare 10% with 20%.
hidden_dropout_ratios: Trying 0%, 10%, and 50%. If 10% then it means at each neuron, for each training sample, there is a 10% chance it will pass on zero to the next layer instead of its actual value.
max_w2: I chose to only use this for the 4-layer networks, and experiment at comparing the default value of infinity (given as Inf in R, and float("inf") in Python), with a value of 20 (arbitrarily chosen).

In this section, I am going to stick with an activation function of Rectifier (WithDropout). The aforementioned article concluded it was better than Maxout, and perhaps slightly better than Tanh, while being notably quicker than Tanh.

You may remember from a previous section that we cannot experiment with hidden_dropout_ratios in a grid unless we fix the number of layers of hidden neurons. But I want this first grid to experiment with 2-, 3-, and 4-layer neural networks. So the grid had to be made in three steps (as long as the same grid ID is chosen, H2O will combine the results for you). In the first step I compared these 2-hidden-layer models. The number in brackets is how many weights.

200 x 200 (221,600)
512 x 512 (727,040)
1024 x 1024 (1,978,368)

In the second step I tried one 3-layer model and, later on, added another one:

400 x 800 x 800 (1,327,200)
1024 x 1024 x 2048 (4,085,760)

And in the third step I tried two 4-layer models:

200 x 200 x 200 x 200 (301,600)
300 x 400 x 500 x 600 (895,400)

The final hyper-parameter I added was seed. This was solely for telling the difference between different runs, not for reproducibility: with H2O’s implementation of deep learning you cannot expect the same model even if you try again with the same seed.

Naturally, with so many combinations of hyper-parameters, I set it to do “RandomDiscrete.” By the way, if you are still following along on a notebook, you are not going to get many models built: this is more a “go away for the weekend” grid, than a “go and get a cup of tea” grid.

So, what did we learn? The clearest thing: l1=0.00001 is way better than 0; all the top 13 models used L1 regularization. Everything else is a bit fuzzier. max_w2 seems to have no effect, and input_dropout_ratio of 0.1 versus 0.2 seems to make little difference (the best model, and four of the top- ive, use 0.2, though).

hidden_dropout_ratios is a bit more confusing. The best model used 0.1 for its 4 layers. We can say that 0.0 is poor: in one direct comparison, 0.5 had 124 validation errors (out of 10,000) while hidden_dropout_ratios=0.0 had 150. But the only direct comparison I got between 0.1 and 0.5 was that 0.5 was better by just 3 (an error rate per 10,000 of 167 versus 170).

So, what about hidden neurons? Table 8-1 shows the best 10 models, with their validation errors (per 10,000), and also the logloss on the validation data set. The right two columns show the relative time²⁰ it took to build the model, and the number of epochs; the asterisk marks those that hit the limit, and didn’t early-stop.

Table 8-1. Best 10 models from first grid
Hidden	Errors	Logloss	Time	Epochs
300,400,500,600	121	0.0701	796	210
1024,1024	124	0.0602	2079	410
1024,1024,2048	126	0.0720	3382	459
400,800,800	127	0.0673	1266	501*
1024,1024	134	0.0655	1660	200
400,800,800	137	0.0646	1419	272
512,512	137	0.0634	943	476
1024,1024	150	0.0671	2249	240
300,400,500,600	161	0.0699	1037	501*
300,400,500,600	167	0.0820	1099	501*

Well, no clear conclusion about which is best! Another way to view this table is that all of these hidden layer alternatives have about the same potential to learn. However, the 4-layer model has the best score (even if it also has the two worst scores), telling me it has the capacity to learn well on this data, so I will go with that.

The final model is shown next, and this code shows how I am giving it a less strict early-stopping criteria (not just lowering the tolerance, but increasing stopping rounds—important if it is going to wobble noisily on its way to improving) and up to 2000 epochs:

DLt <- h2o.deeplearning(x, y, train, validation_frame = valid,
  model_id = "DL_tuned", seed = seed,
  hidden = c(300,400,500,600),
  activation = "RectifierWithDropout",
  l1 = 0.00001,
  input_dropout_ratio = 0.2,
  hidden_dropout_ratios = c(0.1, 0.1, 0.1, 0.1),

  classification_stop = -1,
  stopping_metric = "misclassification",
  stopping_tolerance = 0.001,
  stopping_rounds = 8,
  epochs = 2000
  )
)

Tip

Giving a “best” model from a grid some less strict early-stopping criteria is a great time to use checkpoints (see “Checkpoints”); in this case that would have given me a 796-epoch head-start, and guaranteed a model with at least the score it got in the grid.

It took approximately two hours to build on my machine,²¹ and ended up with a validation score of 130 (not as good as in the grid—see the tip for what I should have done), and on the test data it had 138 errors: easily the best of the models we have built in this book. It used 642 epochs in the end, though the model it returned was actually from epoch 275 (at 26 minutes), so it spent a lot of time bouncing around after that. The results of all four learning algorithms will be compared in “MNIST Results” in the final chapter of this book. There is also a section in that chapter on how to improve the result further.

Football: Default Deep Learning

This data set, predicting football results based on a mix of recent performance and betting odds (see “Data Set: Football Scores” in Chapter 3), has a lot of noise, so a bit like the MNIST data set we can perhaps expect dealing with overfitting to be our main challenge? But it is also a different kind of noise: in the MNIST data all the clues were there, and a human could expect to score over 99.5%. With the football result prediction, the human experts only reach an accuracy of 0.634.²² That is, the clues are not all there.

If you are following along on your own machine, run either Example 3-6 or Example 3-7 from the earlier chapter, but make sure you load the csv files with the missing data removed/patched,²³ i.e., football.train2.csv, football.valid2.csv, and football.test2.csv. Deep learning would otherwise be handicapped by missing data. Running those listings will set up H2O, load the data, and define train, valid, test, x, xNoOdds, and y. Because valid is defined, cross-validation will not be used with this data set.

As in other chapters we are making two versions of the model, and then using compareModels() (see Example 5-1 in Chapter 5; it is also found in football_helper.R in the online code) to get some metrics on them. We are trying to predict if each match will be “home-win” or “draw-or-away-win.” We try it two ways:

Pre-match team strength estimates + pre-match bookmaker odds
Just the pre-match team strength estimates

It is a binomial classification so we are focusing on the AUC and accuracy scores:

m1 <- h2o.deeplearning(x, y, train,
  model_id = "DL_defaults_Odds",
  validation_frame = valid, seed = seed)

m2 <- h2o.deeplearning(xNoOdds, y, train,
  model_id = "DL_defaults_NoOdds",
  validation_frame = valid, seed = seed)

The AUC scores are summarized here:

         Odds    NoOdds
train   0.647     0.607
valid   0.672     0.630
test    0.645     0.606

And the accuracy results:

      HomeWin HW-NoOdds
train   0.612     0.586
valid   0.648     0.617
test    0.624     0.601

You can see the results are not great, but not bad. We do not seem to have any overfitting; instead what this seems to show is the validation data set is a bit easier to make predictions on. Remember that this was time-series data, so the three data sets are consecutive in time, not randomly sampled. It may be that the particular seasons we use for valid and test were more predictable, e.g., fewer upsets, more matches following the form book. And vice versa: sometimes what smells like overfitting can just be that the test data set has more noise.

“Football Data” in the final chapter will compare the results of all models.

Football: Tuned Deep Learning

The first thing is to allow more epochs, but to keep that under control by using early stopping. As you have already seen in this chapter, this can be the biggest improvement we will get, so let’s first try that and nothing else. I will try this for both models. Here is the code for the second model; pay attention to the last four lines:

m2es <- h2o.deeplearning(xNoOdds, y, train,
  model_id = "DL_ES_NoOdds",
  validation_frame = valid, seed = seed,
  replicate_training_data = TRUE,

  stopping_metric = "AUC",
  stopping_tolerance = 0.01,
  stopping_rounds = 3,
  epochs = 1000
  )

That is saying it can have 1000 epochs (100 times more effort than the default model), but if it goes three scoring rounds with less than a 1% improvement in the AUC metric, then stop.

Tip

I’ve also set replicate_training_data to true; because our data set it fairly small, this will speed things up if you run on a multinode cluster, and won’t have any effect if not.

Running with more epochs, on each model, gave these results (AUC):

         Odds    NoOdds
train   0.622     0.584
valid   0.632     0.605
test    0.601     0.577

And these accuracy numbers:

         Odds    NoOdds
train   0.593     0.571
valid   0.637     0.611
test    0.588     0.589

In the case of m2es it ran for about 15 times as long as the default, and did 832 epochs instead of 10. So why are the results so disappointing? The scoring history chart (again, for m2es) can answer that; see Figure 8-6 (a screenshot from the Flow interface).

As it got better at learning the training data (bottom line), it got worse at the validation data (the top line). At the end it chose the best model on the validation set, which was the first one it scored. Incidentally, over in the training data results the AUC had reached 0.9844 by epoch 331; at that point the AUC on the validation set was 0.5589, much worse than the 0.6323 it returned.

I will come back to tackling those diverging training/validation results, but next let’s consider the number of hidden layers, and the number of neurons. All of the previous default models used two hidden layers, with 200 neurons in each. I will experiment with just one of our models: I’ve chosen the second model (predicting a home win, but without the help of the betting odds). At the end we will apply it to the data with the betting odds, and hope the best hidden layer choice applies equally well to that.

I have three questions:

Are three layers better than two?
Are more neurons needed in the first layer?
Are more neurons needed in the final layer?

To answer those three questions, the three topologies I will try are 200x200x200, 400x200, and 200x400, respectively. The first one adds 40,000 weights (200x200). We have N inputs, so the second one adds 200*N weights. And we have two outputs²⁴ so the third one adds just 200*2 weights.

But, I want to consider dropouts at this point too. No, not those nerdy losers who drop out of school to start massively successful companies. If you’d been paying attention you’d know I mean these two parameters:

hidden_dropout_ratios
input_dropout_ratio

Rather than use a grid, I will try 0.3 for input_dropout_ratio (throw 30% of the inputs away each time) and 0.5,0.3 for hidden_dropout_ratios (drop 50% of the outputs from the first hidden layer, then drop 30% of each subsequent layer).

I will go with RectifierWithDropout for the activation function. On a hunch. Why not Tanh? Because it gives similar results to Rectifier but is slower. Why not Maxout? Because I don’t think summing inputs will be so useful here. But mainly because I don’t think the type of activation function will matter too much, and I want to focus my CPU cycles elsewhere. If you experiment and discover one of the others is better, let me know.

Though I am using dropouts, I will use L1 and L2 regularization too. I’ll try 0.0005 (5e-4) for each. And set balance_classes to true. And I’m going to set shuffle_training_data to true, too, because the documentation tells me to do that when I set balance_classes.

To get a baseline measurement, so the network topologies can be fairly evaluated, I first try all those new parameters with the default of two hidden layers, with 200 neurons in each.

Example 8-3 is what that looks like. Later versions will just change the hidden line:

Example 8-3. Deep learning first tuned model

m2_200x200 <- h2o.deeplearning(xNoOdds, "HomeWin", train,
  model_id = "DL_200x200",
  validation_frame = valid, seed = seed,

  replicate_training_data = TRUE,
  balance_classes = TRUE,
  shuffle_training_data = TRUE,

  hidden = c(200,200),

  activation = "RectifierWithDropout",
  hidden_dropout_ratios = c(0.5, 0.3),
  input_dropout_ratio = 0.3,
  l1 = 0.0005,
  l2 = 0.0005,

  stopping_metric = "AUC",
  stopping_tolerance = 0.01,
  stopping_rounds = 3,
  epochs = 1000
  )

Here are the results, compared with the default m2 model, and the version that was just given more epochs. First AUC:

      Defaults MoreEpochs Drop+Reg
train    0.607      0.584    0.594
valid    0.630      0.605    0.630
test     0.606      0.577    0.604

Then accuracy:

      Defaults MoreEpochs Drop+Reg
train    0.586      0.571    0.568
valid    0.617      0.611    0.619
test     0.601      0.589    0.609

So dropping out is not just good for socially awkward geniuses. Figure 8-7 is the MSE chart, over time, for the model doing dropout and regularization. You might miss this if you are looking at it in black-and-white, but the validation result is the lower (better) line!

Note

It chose the “best” model at the end, so why didn’t it choose one of those at around 750 or 1000 epochs? Because this is the MSE chart, but the decision of best was made using AUC. And the best result according to AUC was at 203 epochs. If you ever find one of these scoring charts confusing, go and look at the more detailed scoring history.

Next I used those results to make the models with the different topologies previously described. After taking a first look, I decided to also try two hidden layers with 400 neurons in each.

Here is the comparison table for AUC:

      200x200 200x200x200 200x400 400x200 400x400
train   0.594       0.597   0.591   0.598   0.595
valid   0.630       0.631   0.633   0.629   0.630

And then for accuracy:

      200x200 200x200x200 200x400 400x200 400x400
train   0.568       0.571   0.563   0.570   0.565
valid   0.619       0.621   0.618   0.615   0.618

So more input neurons in the first layer didn’t help. And it didn’t really help in the second layer either. But a third layer does seem to have helped, if only slightly. (Though, without running more experiments I cannot say for sure if this isn’t just random variation we are seeing.)

Well, if three is good, then four must be even better? AUC then accuracy:

      200x200 200x200x200 200x200x200x200
train   0.594       0.597             0.5
valid   0.630       0.631             0.5

      200x200 200x200x200 200x200x200x200
train   0.568       0.571           0.501
valid   0.619       0.621           0.421

Oh. That’s a “no” then. Those are some really bad results. It looks like there is just too much noise in the data set to be able to train a 4-layer network.

To make my final model I will go with 200x200x200, but relax the early-stopping criteria, so it can have up to 2000 epochs (up from 1000 epochs), unless it fails to improve by 0.1% over 4 scoring rounds (was: 1% over 3 scoring rounds). This model will be tried on the test data to predict home wins, and I will make versions both with and without the odds data.

Here are those new early-stopping criteria; the rest of the code is identical to the version shown previously:

  stopping_metric = "AUC",
  stopping_tolerance = 0.001,
  stopping_rounds = 4,
  epochs = 2000

Here are the final results for our two models. First the AUCs:

      HomeWin HW-NoOdds
train   0.635     0.596
valid   0.678     0.632
test    0.648     0.616

Next, the accuracy values:

      HomeWin HW-NoOdds
train   0.595     0.567
valid   0.649     0.618
test    0.617     0.606

How have we done? Well, extra epochs improved the train and validation scores, but our reference AUCs were 0.675 for the validation data set, and 0.650 for the test data. One above, one below. And on accuracy, the targets were 0.650 and 0.634, so we’ve done poorly there. A reminder that “Football Data” in the last chapter of this book compares all the algorithms on this data set.

Summary

Tuning deep learning can feel more like art than science, and with so many parameters it can always leave you feeling like you missed something important.

The superior performance on the MNIST data was what we expected. The poor performance on the football data was a bit unexpected: it tells me that with difficult, noisy, data sets deep learning can fail to outperform the other algorithms, and take longer doing it. The building energy results were the most interesting. It needed both the full training data (not just the 90% that cross-validation gave it), and the extra epochs from switching to early stopping, to get a big jump in comprehension—one that none of the algorithms managed. Similarly, “Deep Learning Auto-Encoder” (in the next chapter) shows a trend between increasing the number of hidden neurons and MSE but, again, it really only appeared when given enough epochs.

So, I’m still wondering if I missed something important on the football data.

The remainder of this chapter gives a list of all the other parameters that might be that “something important.” Or you can jump ahead to the next chapter, to see what H2O offers for unsupervised learning.

Appendix: More Deep Learning Parameters

My main criterion for putting a parameter here, rather than at the top of the chapter, was that I didn’t use it in this book:

missing_values_handling: Handling of missing values. If “Skip” then rows with missing values are ignored. If “MeanImputation,” which is the default, then missing values are assigned the mean value of that column.
use_all_factor_levels: This is true by default, meaning there is one input neuron for every level for each enum (categorical, factor) variable. If you set it to false then the first level in each enum is dropped. This can be done with no loss in accuracy (and a small speed-up in training speed, due to one less neuron). However, if you have set variable_importances to true, then you should keep use_all_factor_levels as true.
max_categorical_features: The maximum number of categorical features, enforced via hashing (Experimental). The default is to have no limit.
single_node_mode: The default is false. If true, then it will run on a single node of your cluster. I find cluster scaling to be quite efficient for deep learning, so it is hard to imagine a need for it that is not better served by tweaking target_ratio_comm_to_comp.
fast_mode: Enable fast mode (minor approximation in backpropagation). It defaults to true.
force_load_balance: Force extra load balancing to increase training speed. Defaults to true.
standardize: Defaults to true, meaning the data will automatically be normalized. If false then you need to scale the data as part of your data preparation.
sparse: Defaults to false. Set it to true if your data has lots of zero values, to make it more efficient.
sparsity_beta: Sparsity regularization (Experimental). The default is 0.0.

You can specify the initial state of the neural network; you might do this if you had previously trained the neural net. (In fact, if you just wanted to load a previously trained network, and not train it any more, set these parameters, and also set epochs to zero.)

initial_biases: A list of H2OFrame IDs to initialize the bias vectors of this model with.
initial_weights: A list of H2OFrame IDs to initialize the weight matrices of this model with.

But normally you will let the weights be initialized randomly. The following two parameters allow you control over that:

initial_weight_distribution: This defaults to “UniformAdaptive,” but the alternatives are “Uniform” and “Normal.” (UniformAdaptive is an optimized initialization that considers the size of the network.)
initial_weight_scale: This is a double. If initial_weight_distribution is “Uniform,” then this is the range. For example, if you give 0.5, then the initial weights will be randomly between –0.5 and +0.5. If initial_weight_distribution is “Normal” then this is the standard deviation for weights. That is, the same 0.5 would have 68% of weights between –0.5 and +0.5, but 16% would be above +0.5, and 16% would be below –0.5. The default value is 1.0.

The remaining parameters have to do with the learning rate. I find the H2O default behavior to be intelligent enough that I would rather spend my tuning time on other things. Then there is the fear of not knowing if I’ve made things worse! If you want to learn more about these, many neural net video and book courses cover them:

rate: Learning rate. Higher will be less stable, while lower means it will take longer (more epochs) to converge. The default is 0.005.
rate_annealing: Learning rate annealing: rate / (1 + rate_annealing * samples). Default is 1e–6.
rate_decay: Learning rate decay factor between layers (N-th layer: rate*alpha^(N-1)). Default is 1, meaning it is not used by default.

adaptive_rate: Adaptive learning. Defaults to true.
epsilon: Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress). The default is 1e–8.
rho: Adaptive learning rate time decay factor. Defaults to 0.99.
momentum_ramp: Number of training samples for which momentum increases. The default is 1e–6 (one million training samples).
momentum_stable: Final momentum after the ramp is over. The default is 0.0.
momentum_start: Initial momentum at the beginning of training. The default is 0.0.
nesterov_accelerated_gradient: Use Nesterov accelerated gradient. It is a boolean and defaults to true; there is not usually a good reason to try false .

¹ The Deep Water project, under development as I write this, will allow leveraging other deep learning libraries, such as Tensorflow, Caffe, and Mxnet, and so will gain GPU support from them.

² This process is called backpropagation, but H2O takes care of the whole process for us. Explanations of the sums involved are easy to find with your favorite search engine, e.g., https://en.wikipedia.org/wiki/Backpropagation.

³ Fractional epochs are allowed. For instance, requesting 10.5 epochs means all training data samples are processed 10 times, but the first half of the training data is also processed an eleventh time.

⁴ Possibly five: each categorical input also gets an extra input neuron for NAs and missing data. It depends if gender-unknown was encoded as NA, or as the literal string “Unknown.” And possibly three: if you set use_all_factor_levels=false (it is true by default) then one of the categories can be implied by setting all of the others to zero, so one of the input neurons can be dropped.

⁵ Search for the universal approximation theorem.

⁶ As an aside, an interesting approach using H2O and an extra data set to reduce the dimensions of large factors such as zip code is shown in this H2O World video by Madeleine Udell (specifically from about 19:00 to 25:00).

⁷ Not to be confused with hyperbole and going off at a tangent, the mainstay of modern politics.

⁸ Due to Hogwild!, a lock-free multithreaded stochastic gradient descent algorithm.

⁹ With quite a bit of variance. On another run I get MSEs 20% higher on all of train, valid, and test data, meaning the model is worse. This is simply because 10 epochs is too few, as we’ll see as soon as we start tuning.

¹⁰ We could test that theory by repeating the grids with stopping_rounds set to 4, then 5, and see if the relative ordering of the best models stays the same.

¹¹ I know we are not supposed to test on the test data set until the end, but the knowledge gained here is not being used to tune: the best model will still be chosen based on the results from cross-validation.

¹² Eight parameters, but X6 and X8 are enums with 4 and 6 levels, respectively, plus the spare input neuron for unseen values. 6 + (4 + 1) + (6 + 1) = 18.

¹³ In some earlier experiments I also tried just 18 neurons in the first layer. They were better than you might expect, but still inferior to the bigger models.

¹⁴ I ran this grid on a 2-node cluster on Amazon EC2, totalling 72 cores, and it averaged 40 seconds per 2-layer model, and 120 seconds per 3-layer model.

¹⁵ Well, sometimes it influences things such as automatically choosing the best numbers for early stopping.

¹⁶ Though you could equally well have made this in R or Python; the data is all there in the m model object that H2O returns.

¹⁷ If you’re curious, on the test data I get an even lower 273 errors.

¹⁸ Meaning, we want to see improvement in the digit recognition, not just in the MSE or logloss.

¹⁹ Chief Architect at H2O.ai, and the main author of the deep learning code.

²⁰ Seconds, but on a 3-node, 108-core cluster I set up for this test.

²¹ Estimated, based on taking 43 minutes on a 2-node, 72-core cluster.

²² This was the result, in “Football: Default GLM”, of using a linear model based on the average bookmaker odds of a win as the only input.

²³ See “Missing Data” in Chapter 9 for how they were made, and also why h2o.deeplearning()’s default of mean imputation is undesirable.

²⁴ You might have expected just one output? Binomial classifications have two outputs, one for the likelihood of it being true, and one for the likelihood of it being false. They get used together for the final model output.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Deep Learning (Neural Nets)

Create new playlist

Sign In

Sign Up

Chapter 8. Deep Learning (Neural Nets)

Tip

What Are Neural Nets?

Note

Figure 8-1. Network, layers, neurons

Numbers Versus Categories

Network Layers

Tip

Activation Functions

Figure 8-2. Rectifier and Tanh activation functions

Parameters

Deep Learning Regularization

Deep Learning Scoring

Tip

Building Energy Efficiency: Default Deep Learning

Figure 8-3. Default performance of deep learning on test data

Building Energy Efficiency: Tuned Deep Learning

Note

Note

Tip

Figure 8-4. Tuned performance of deep learning on test data

Note

MNIST: Default Deep Learning

Example 8-1. Default deep learning on MNIST (Python)

Example 8-2. Default deep learning on MNIST (in R)

Figure 8-5. Logloss and MSE for train and valid data

Note

Note

MNIST: Tuned Deep Learning

Tip

Tip

Football: Default Deep Learning

Football: Tuned Deep Learning

Tip

Figure 8-6. Scoring history (validation on top, training below, lower is better)

Example 8-3. Deep learning first tuned model

Figure 8-7. Scoring history with dropout (training on top, validation below!)

Note

Summary

Appendix: More Deep Learning Parameters

Table of Contents for
8. Deep Learning (Neural Nets)