© Umberto Michelucci 2022
U. MichelucciApplied Deep Learning with TensorFlow 2https://doi.org/10.1007/978-1-4842-8020-1_6

6. Hyper-Parameter Tuning

Umberto Michelucci1  
(1)
Dübendorf, Switzerland
 

Hyper-parameters can be loosely defined as parameters that do not change during training. For example, the number of layers in an FFNN, the number of neurons in each layer, activation functions, learning rate,1 and so on.

This chapter looks at the problem of finding the best hyper-parameters to get the best results from your models. Doing this is called hyper-parameter tuning. We first describe what a black-box optimization problem is, and how those classes of problems relate to hyper-parameter tuning. We look at the three most common methods for tackling those kinds of problems: grid search, random search, and Bayesian optimization. You will learn, with examples, which one works in which conditions. At the end of the chapter, you will see how you can use those techniques to tune a deep model with a real-world dataset (the Zalando dataset). The goal of this chapter is not to give you an overview of all possible Python libraries that can help you with hyper-parameter tuning, but at the end of the chapter, you should know what hyper-parameter tuning is, why it’s important, and how it works, at least in its most fundamental form.

Black-Box Optimization

The problem of hyper-parameter tuning is just a subclass of a much more general type of problems: black-box optimizations. A black-box function f(x)
$$ f(x):{mathbb{R}}^n	o mathbb{R} $$
is a function whose analytic form is unknown. A black-box function can be evaluated to obtain its value for all values of x on which it’s defined, but no other information, such as its gradient, can be obtained. Generally, we talk about the global optimization of a black-box function (sometimes called a black-box problem) when we try to find the maximum or minimum of f(x) given certain constraints. Here are some examples of these kinds of problems:
  • Finding the hyper-parameter for a given machine learning model that maximizes the chosen optimizing metric

  • Finding a maximum or minimum of a function, which can only be evaluated numerically or with code that we cannot look at. In some industry contexts, there is legacy code that is very complicated, and there are some functions that must be maximized based on its outcome.

  • Finding the best place to drill for oil. Your function is how much oil you can find, and your x is your location.

  • Finding the best combination of parameters for situations that are too complex to model, such as when launching a rocket in space. How do you optimize the amount of fuel, the diameter of each stage of the rocket, the precise trajectory, and so on.

This is a very fascinating class of problems that has produced smart solutions. We will look at three of them: grid search, random search, and Bayesian optimization.

Why is finding the best hyper-parameters for neural networks a black-box problem? Since we cannot calculate information like the gradients of our network output with respect to the hyper-parameters (for example, the number of layers in an FFNN), especially when using complex optimizers or custom functions, we need other approaches to find the best hyper-parameters that maximize the chosen optimizing metric. Note that if we could have the gradients, we could use an algorithm such as gradient descent to find the maximum or minimum.

Note

Our black-box function f will be our neural network model (including things like the optimizer, cost function form, etc.) which gives as output our optimizing metric, given the hyper-parameters as input. In this case, x will be the array containing the hyper-parameters.

The problem may seem quite trivial, so why not try all the possibilities? Well, this may be possible in the examples we have looked at in the past chapters, but if you are working on a problem and training your model takes a week, this may present a challenge. Since typically you will have several hyper-parameters, trying all possibilities will not be feasible. Let’s look at an example to understand it better. Let’s suppose we are training a model of a neural network with several layers. We may decide to consider the following hyper-parameters to see which combination works best:
  • Learning rate: Let’s suppose we want to try the values n · 10−4 for n = 1, …, 102 (100 values)

  • Regularization parameter: 0, 0.1, 0.2, 0.3, 0.4, and 0.5 (six values)

  • Choice of optimizer: GD, RMSProp, or Adam (three values)

  • Number of hidden layers: 1, 2, 3, 5, and 10 (five values)

  • Number of neurons in the hidden layers: 100, 200, and 300 (three values)

Consider that you will need to train your network
$$ 100	imes 6	imes 3	imes 5	imes 3=27000 $$

times if you want to test all possible combinations. If your training takes five minutes, you will need 13.4 weeks of computing time. If the training takes hours or days, it becomes impossible. If the training takes one day for example, you will need 73.9 years to try all the possibilities. Most of the hyper-parameter choices will come from experience; for example, you can safely always use Adam, since it’s the better optimizer out there (in almost all cases). But you will not be able to avoid trying to tune other parameters like the number of hidden layers or the learning rate. You can reduce the number of combinations you need with experience (as with the optimizer), or with some smart algorithms, as you see later in this chapter.

Notes on Black-Box Functions

Black-box functions are usually classified into two main classes:
  • Cheap functions : Functions that can be evaluated thousands of times

  • Costly functions : Functions that can only be evaluated a few times, usually fewer than 100 times

If the black-box function is cheap then the choice of the optimization method is not critical. For example, we can evaluate the gradient with respect to the x numerically, or simply search the maximum by evaluating the functions on a large number of points. If the function is costly, we need much smarter approaches. One of these is Bayesian optimization, which we discuss later in this chapter to give you an idea on how those methods work and how complex they are.

Note

Neural networks are almost always, especially in the deep learning world, costly functions.

For costly functions we need to find methods that solve our problem with the smallest number of evaluation possible.

The Problem of Hyper-Parameter Tuning

Before looking at how we can find the best hyper-parameters, we need to quickly go back to neural networks and discuss what can we tune in deep models. Typically, when talking about hyper-parameters, beginners think only of numerical parameters, such as the learning rate or regularization parameter, for example. Remember that the following can also be varied to see if you can get better results:
  • Number of epochs: Sometimes simply training your network longer will give you better results.

  • Choice of optimizer: You can try choosing a different optimizer. If you are using plain gradient descent, you may try Adam and see if you get better results.

  • Varying the regularization method: As discussed previously, there are several ways of applying regularization. Varying the method may well be worth trying.

  • Choice of activation function: Although the activation function used in the past chapters for neurons in hidden layers was ReLU, others may work a lot better. Trying for example sigmoid or Swish may help you get better results.

  • Number of layers and number of neurons in each layer: Try different configurations. Try layers with different numbers of neurons for example.

  • Learning rate decay methods: Try different learning rate decay methods (if you are not using optimizers that do this already).

  • Mini-batch size: Vary the size of the mini-batches. When you have a small amount of data you can use batch gradient descent. When you have a lot of data, mini-batches are more efficient.

  • Weight initialization methods.

Let’s classify the parameters we can tune in our models in the following three categories:
  1. 1.

    Parameters that are continuous real numbers, or in other words, that can assume any value. Examples: learning rate and regularization parameter.

     
  2. 2.

    Parameters that are discrete but can theoretically assume an infinite number of values. Examples: number of hidden layers, number of neurons in each layer, or number of epochs.

     
  3. 3.

    Parameters that are discrete and can only assume a finite number of possibilities. Examples: optimizer, activation function, and learning rate decay method.

     

For category 3 there is not much to do except try all the possibilities. They typically will change the model completely and it’s impossible to model their effect, therefore making a test the only possibility. This is also the category where experience may help the most. It’s widely known that the Adam optimizer is almost always the best choice, so you may concentrate your efforts somewhere else at the beginning. For categories 1 and 2, it’s a bit more difficult, and you need to use some smart approaches to find the best values.

Sample Black-Box Problem

To try our hands at solving a black-box problem, let’s create a “fake” black-box problem. Find the maximum of the function f(x) given by the formula
$$ {displaystyle egin{array}{l}g(x)=cos frac{x}{4}-sin frac{x}{4}-frac{5}{2}cos frac{x}{2}+frac{1}{2}sin frac{x}{2}\ {}h(x)=-cos frac{x}{3}-sin frac{x}{3}-frac{5}{2}cos frac{2}{3}x+frac{1}{2}sin frac{2}{3}x\ {}f(x)=10+g(x)+frac{1}{2}h(x)end{array}} $$
pretending not to know the formula itself. The formula will let you see the results, but let’s pretend they are unknown. You may wonder why we use such a complicated formula. We wanted to have something with a few maxima and minima to show how the methods work on a non-trivial example. f(x) can be implemented in Python with the code
def f(x):
    tmp1 = -np.cos(x/4.0)-np.sin(x/4.0)
           -2.5*np.cos(2.0*x/4.0)+0.5*np.sin(2.0*x/4.0)
    tmp2 = -np.cos(x/3.0)-np.sin(x/3.0)
           -2.5*np.cos(2.0*x/3.0)+0.5*np.sin(2.0*x/3.0)
    return 10.0+tmp1+0.5*tmp2
Figure 6-1 shows how f(x) looks.
Figure 6-1

A plot of the function f(x) described in the text

The maximum is at an approximate value of x = 69.18 and has a value of 15.027. Our challenge is to find this maximum in the most efficient way possible, without knowing anything about f(x) except its value at any point. When we say “efficient,” we mean with the smallest number of evaluations possible.

Grid Search

The first method we look at, grid search , is also the less “intelligent”. Grid search entails simply trying the function at regular intervals and seeing for which x the function f(x) assumes the highest value. In this example, we want to find the maximum of the function f(x) between two x values, xmin and xmax. What we do is simply take n points equally spaced between xmin and xmax and evaluate the function at those points. We define a vector of points
$$ oldsymbol{x}=left({x}_{min},{x}_{min}+frac{varDelta x}{n},dots, {x}_{min}+left(n-1
ight)frac{varDelta x}{n}
ight) $$
where we defined Δx = xmax − xmin. Then we evaluate the function f(x) at those points, obtaining a vector f of values
$$ oldsymbol{f}=left(fleft({x}_{min}
ight),fleft( {x}_{min}+frac{varDelta x}{n}
ight),dots, fleft({x}_{min}+left(n-1
ight)frac{varDelta x}{n}
ight)
ight) $$
the estimate of the maximum $$ left(overset{sim }{x},overset{sim }{f}
ight) $$ will then be
$$ overset{sim }{f}=underset{0le ile n-1}{max }{f}_i $$
and supposing the maximum is found at $$ i=overset{sim }{i} $$, we will also have
$$ overset{sim }{x}={x}_{min}+frac{overset{sim }{i}Delta x}{n} $$
Now, as you may imagine, the more points you use, the more accurate your maximum estimation will be. The problem is that, if the evaluation of f(x) is costly, you will not be able to test as many points as you might like. You will need to find a balance between the number of points and the accuracy. Let’s consider an example with the function f(x) we described earlier. Let’s consider xmax = 80 and xmin = 0 and let’s take n = 40 points. We will have $$ frac{Delta x}{n}=2 $$. We can create the vector x easily in Python with the code:
gridsearch = np.arange(0,80,2)
The array gridsearch will look like this
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78])
Figure 6-2 shows the function f(x) as a continuous line, the crosses mark the points we sample in the grid search , and the black square marks the precise maximum of the function. The right plot shows a zoom around the maximum.
Figure 6-2

The function f(x) on the range [0, 80]. The crosses mark the point we sample in the grid search, and the black square marks the maximum

You can see how the points we sample in Figure 6-2 get close to the maximum, but don’t get it exactly. Of course, sampling more points would get us closer to the maximum, but would cost us more evaluations of f(x). We can find the maximum $$ left(overset{sim }{x},overset{sim }{f}
ight) $$ easily with the trivial code
x = 0
m = 0.0
for i, val in np.ndenumerate(f(gridsearch)):
    if (val > m):
        m = val
        x = gridsearch[i]
print(x)
print(m)
That gives us
70
14.6335957578
Close to the actual maximum (69.18, 15.027), but not quite right. Let’s try the previous example and vary the number of points we sample. We will vary the number of points sampled n from 4 to 160. For each case, we will find the maximum and its location as described earlier. We can do it with this code
xlistg = []
flistg = []
for step in np.arange(1,20,0.5):
    gridsearch = np.arange(0,80,step)
    x = 0
    m = 0.0
    for i, val in np.ndenumerate(maxim(gridsearch)):
        if (val > m):
            m = val
            x = gridsearch[i]
    xlistg.append(x)
    flistg.append(m)

in the lists xlistg and flistg, we will find the position of the maximum found and the value of the maximum for the various values of n.

Figure 6-3 shows the a plot of the distributions of the results. The black vertical line is the correct value of the maximum.
Figure 6-3

The distribution of the results for $$ overset{sim }{f} $$ obtained by varying the number of points n sampled in the grid search. The black vertical line indicates the real maximum of f(x)

As you can see , the results vary quite a lot and can be very far from the correct value, as far as 10. This tells us that using the wrong number of points can return very wrong results. As you can imagine, the best results are the ones with the smallest step Δx, since it’s more likely to be closer to the maximum. Figure 6-4 shows the value of the found maximum varies with the step Δx.
Figure 6-4

The behavior of the found value of the maximum vs. the x step Δx

In the zoom in the right plot of Figure 6-4, it’s evident how smaller values of Δx get you better values of $$ overset{sim }{f} $$. Note that a step of 1.0 means sampling 80 values of f(x). If, for example, the evaluation takes one day, you will need to wait 80 days to get all the measurements you need.

Tip

Grid search is efficient only when the black-box function is cheap. To get good results, a large number of sampling points is usually needed.

To make sure you are really getting the maximum , you should decrease the step Δx, or increase the number of sampling points, until the maximum value you find does not change appreciably anymore. In the previous example, as we see from the right plot in Figure 6-4, we are sure we are close the maximum when our step Δx gets smaller than roughly 2.0, or in other words when the number of sampled points is greater or roughly equal to 40. Remember that 40 may seems like a small number at first sight, but if f(x) evaluates the metric of your deep learning model, and the training takes for example two hours, you are looking at 3.3 days of computer time.

Normally in the deep learning world, two hours is not much for training a model, so make a quick calculation before starting a long grid search. Additionally, keep in mind that when doing hyper-parameter tuning, you are moving in a multi-dimensional space (you are not optimizing only one parameter, but many), so the number of evaluations you need gets big very quickly.

Let’s look at a quick example. Let’s suppose you decide you can afford 50 evaluations of your black-box function. If you decide you want to try the following hyper-parameters:
  • Optimizer (RMSProp, Adam or plain GD) (three values)

  • Number of epochs (1000, 5000 or 10000) (three values)

You are already looking at nine evaluations. How many values of the learning rate can you then afford to try? Only five! And with five values, it’s not probable that you’ll get close to the optimal value. This shows how grid search is viable only for cheap black-box functions. Remember that, often, time is not the only problem. If you are using the Google Cloud platform to train your network, for example, you are paying the hardware you use by the second. Maybe you have lots of time at your disposal, but costs may go over budget very quickly .

Random Search

A strategy that is as “dumb” as grid search , but that works amazingly better, is a random search. Instead of sampling x points regularly in the range (xmin, xmax), you sample the points randomly. We can do it with this code
import numpy as np
randomsearch = np.random.random([40])*80.0
The randomsearch array will look like this
array([ 0.84639256, 66.45122608, 74.12903502, 36.68827838, 61.71538757, 69.29592273, 48.76918387, 69.81017465, 1.91224209, 21.72761762, 22.17756662, 9.65059426, 72.85707634, 2.43514133, 53.80488236, 5.70717498, 28.8624395 , 33.44796341, 14.51234312, 41.68112826, 42.79934087, 25.36351055, 58.96704476, 12.81619285, 15.40065752, 28.36088144, 30.27009067, 16.50286852, 73.49673641, 66.24748556, 8.55013954, 29.55887325, 18.61368765, 36.08628824, 22.1053749 , 40.14455129, 73.80825225, 30.60089111, 52.01026629, 47.64968904])
Depending on the seed you used, the actual numbers you get may be different. As we have done for grid search, Figure 6-5 shows the plot of f(x), where the crosses mark the sampled points, and the black square is the maximum. On the right plot, you see a zoom around the maximum.
Figure 6-5

The function f(x) on the range [0, 80]. The crosses mark the point we sampled with random search, and the black square marks the maximum

The risk with this method is that, if you are very unlucky, your random chosen points are nowhere close the real maximum. But that probability is quite low. Note that if you take a constant probability distribution for your random points, you have the same probability of getting the points everywhere. Now it is interesting to see how this method performs. Let’s consider 200 different random sets of 40 points, obtained by varying the random seed used in the code. The distributions of the maximum found $$ overset{sim }{f} $$ are plotted in Figure 6-6.
Figure 6-6

The distribution of the results for $$ overset{sim }{f} $$ obtained by 200 different random sets of 40 points sampled in the random search. The black vertical line indicates the real maximum of f(x)

As you can see, regardless of the random sets used, you get very close to the real maximum. Figure 6-7 shows the distributions of the maximum found with random search and by varying the number of points sampled from 10 to 80.
Figure 6-7

The distribution of the results for $$ overset{sim }{f} $$ obtained by varying the number of points n sampled in the random search, from 10 to 80. The black vertical line indicates the real maximum of f(x)

If you compare this with grid search, you can see that random search is better at consistently getting results closer to the real maximum. Figure 6-8 compares the distribution you get for your maximum $$ overset{sim }{f} $$ when using a different number of sampling points n with random and with grid search. In both cases, the plots were generated with 38 different sets, so the total count is the same.
Figure 6-8

A comparison of the distribution of $$ overset{sim }{f} $$ between grid (right) and random (left) search while varying the number of sampling points n. Both plots counts sum to 38, the number of different numbers of sampling points used. The correct value of the maximum is marked with the vertical black line in both plots

It’s easy to see how, on average, random search is better than grid search. The values you get are consistently closer to the right maximum.

Tip

Random search is consistently better than grid search and you should use it every time it’s possible. The difference between random search and grid search becomes even more marked when dealing with a multi-dimensional space for your variable x. Hyper-parameter tuning is nearly always a multi-dimensional optimization problem.

If you are interested in a very good paper on the random search scale at high-dimensional problems, you can read the paper by J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization” [1].

Coarse to Fine Optimization

There is still an optimization trick that helps with grid or random search. It is called “coarse to fine optimization .” Let’s suppose we want to find the maximum of f(x) between xmin and xmax. We will look at the idea for random search, but it works in the same way as with grid search. The following steps give you the algorithm you need to follow.
  1. 1.

    Do a random search in the region R1 = (xmin, xmax). Indicate the maximum found with (x1, f1).

     
  2. 2.

    Consider a smaller region around x1, R2 = (x1 − δx1, x1 + δx1) for some δx1 (which we discuss later) and do a random search in this region. The hypothesis is of course that the real maximum lies in this region. Indicate the maximum you find here with (x2, f2).

     
  3. 3.

    Repeat Step 2 around x2, in the region indicated with R3 with a δx2 smaller than δx1 and indicate the maximum you find in this step with (x3, f3).

     
  4. 4.

    Repeat step 2 again around x3, in the region we indicate with R4 with a δx3 smaller than δx2.

     
  5. 5.

    Continue the same way as many times as you need until the maximum (xi, fi) in the region Ri + 1 does not change any more.

     

Just one or two iterations are typically needed, but theoretically you could go on for a large number of iterations. The problem with this method is that you cannot be sure that your real maximum actually lies in your regions Ri. But this optimization has a big advantage if it does.

Let’s consider the case where we do a standard random search. If we want to have on average a distance between the sampled points of 1% of xmax − xmin we would need around 100 points if we decided to perform only one random search, and consequently we had to perform 100 evaluations. Now let’s consider the algorithm we just described. We could start with just ten points in region R1 = (xmin,  xmax), where we indicate the maximum we find with (x1, f1). Then let’s take $$ 2delta x=frac{x_{max}-{x}_{mathrm{min}}}{10} $$ and let’s use ten points in region R2 = (x1 − δx, x1 + δx). In the interval (x1 − δx, x1 + δx), we will have on average a distance between the points of 1% of xmax − xmin, but we just sampled our functions only 20 times instead of 100, so a factor of five fewer! For example, let’s sample the function we used previously between xmin = 0 and xmax = 80 with ten points with this code
np.random.seed(5)
randomsearch = np.random.random([10])*80
x = 0
m = 0.0
for i, val in np.ndenumerate(f(randomsearch)):
    if (val > m):
        m = val
        x = randomsearch[i]
This gives us the maximum location and value of x1 = 69.65 and f1 = 14.89, so not bad, but not as precise as the real ones—69.18 and 15.027. Now let’s sample ten points around the maximum we found in the regions R2 = (x1 − 4, x1 + 4)
randomsearch = x + (np.random.random([10])-0.5)*8
x = 0
m = 0.0
for i, val in np.ndenumerate(maxim(randomsearch)):
    if (val > m):
        m = val
        x = randomsearch[i]

This gives us the result 69.189 and 15.027. Quite a precise result with only 20 evaluations of the function. If we do a plain random search with 500 (25 times more than what we just did) sampling points, we get x1 = 69.08 and f1 = 15.022. This result shows how this trick can be really helpful. But remember the risk: If your maximum is not in your regions Ri you will never be able to find it, since you are still dealing with random numbers. So, it is always a good idea to choose the regions (xi − δxi, xi + δxi) that are relatively big to make sure that they contain your maximum.

How big they need to be depends, as almost everything in the deep learning world, on your dataset and the problem at hand and may be impossible to know in advance. Testing is unfortunately required. Figure 6-9 shows the sampled points on the function f(x). In the plot on the left you see the first ten points, and on the right the region R2 with the additional ten points. The small rectangle on the left plot marks the x region R2.
Figure 6-9

The function f(x). The crosses mark the sampled points. On the left the ten points sampled in the region R1 (entire range), on the right the ten points sampled in R2. The black square marks the real maximum.The plot on the right is the zoom of the region R2

The decision of how many points you should sample at the beginning is crucial. We had luck here. Let’s consider a different seed when choosing our initial ten random points and see what can happen. As you can see in Figure 6-10, choosing the wrong initial points leads to disaster!
Figure 6-10

The function f(x). The crosses mark the sampled points. On the left the ten points sampled in the region R1 (entire range), on the right the ten points sampled in R2. The black square marks the real maximum. The algorithm finds a maximum very well around 16, but it's not the absolute maximum. The small rectangle on the left marks the region R2. The plot on the right is the zoom on the region R2

In Figure 6-10, note how the algorithm finds the maximum at around 16, since in the initial sampled points the maximum value is around x = 16, as you can see on the plot on the left in Figure 6-10. No points are close to the real maximum around x = 69. The algorithm finds a maximum really well, simply not the absolute maximum. That is the danger you face when using this trick. It can even go worse than that. Consider Figure 6-11, where we just sampled one single point at the beginning. You can see on the plot on the left of Figure 6-11, how the algorithm completely misses any maximum. It simply gives as a result the highest value of the points marked by crosses on the points on the right plot: (58.4, 9.78).
Figure 6-11

The function f(x). The crosses mark the sampled points. On the left the one point sampled in the regions R1 (entire range), on the right the ten points sampled in R2. The black square marks the real maximum. The algorithm does not find a maximum since none is present in the region R2. The plot on the right is the zoom on the region R2

If you decide to use this method, keep in mind that you will still need a good number of points at the beginning to get close to the maximum before refining your search. After you are relatively sure you have points around the maximum, you can use this technique to refine your search.

Bayesian Optimization

Nadaraya-Watson Regression

To understand Bayesian optimization, we have to look at some new mathematical concepts. Let’s start with the Nadaraya-Watson regression , an idea from 1964[2]. The basic idea is quite simple. Given an unknown function y(x) and given N points xi = 1, …, N, we indicate with yi = f(xi) with i = 1, …, N the value of the function calculated at the different xi. The idea of the Nadaraya-Watson regression is that we can evaluate the unknown function f at an unknown point x using the formula
$$ y(x)=sum limits_{i=1}^N{w}_i(x){y}_i $$
where wi(x) are weights that are calculated according to the formula
$$ {w}_i(x)=frac{Kleft(x,{x}_i
ight)}{sum limits_{j=1}^NKleft(x,{x}_j
ight)} $$
where K(x, xi) is called a kernel. Note that given how the weights are defined we have
$$ sum limits_{i=1}^N{w}_i(x)=1 $$
In the literature, you can find several kernels but the one we are interested in is the Gaussian one, often called the Radial Basis Function (RBF)
$$ Kleft(x,{x}_i
ight)={sigma}^2{e}^{-frac{1}{2{l}^2}{leftVert x-{x}_i
ightVert}^2} $$

The parameter l makes the Gaussian shape wider or narrower. The σ is typically the variance of your data, but in this case, it plays no role since the weights are normalized to 1. This is at the base of Bayesian optimization, as you will see later.

Gaussian Process

Before talking about Gaussian processes , we must define what a random process is. We talk of a random process when for any point x ∈ d we assign a random variable f(x) ∈ . A random process is Gaussian if for any finite number of points, their joint distribution is normal. That means that ∀n ∈ , and for ∀x1, …xn ∈ d then the vector
$$ oldsymbol{f}equiv left(egin{array}{c}fleft({x}_1
ight)\ {}vdots \ {}fleft({x}_n
ight)end{array}
ight)sim mathcal{N} $$
where with the notation we intend that the vector components follow a normal distribution, indicated with $$ mathcal{N} $$. Remember that a random variable with a Gaussian distribution is said to be normally distributed. From here comes the name Gaussian process. The probability distribution of the normal distribution is given by the function
$$ gleft(x 
ight| mu, {sigma}^2Big)=frac{1}{sqrt{2pi {sigma}^2}}{e}^{frac{{left(x-mu 
ight)}^2}{2{sigma}^2}} $$
where μ is the mean or expectation of the distribution and σ is the standard deviation. We will use the following notation from here on
$$ Mean value of f:kern0.75em m $$
and the covariance of the random values will be indicated by K
$$ Covleft[fleft({x}_1
ight),fleft({x}_2
ight)
ight]=Kleft({x}_1,{x}_2
ight) $$

the choice of the letter K is purposeful. We will assume in what follows that the covariance will have a Gaussian shape, and we will use for K the RBF function we defined earlier.

Stationary Process

For simplicity we will consider only stationary processes . A random process is stationary if its joint probability distribution does not change with time. That means that the mean and variance also will not change when shifted in time. We will also consider a process for which its distribution depends only on the relative position of the points. This leads to these conditions
$$ {displaystyle egin{array}{l}Kleft({x}_1,{x}_2
ight)=overset{sim }{K}left({x}_1-{x}_2
ight)\ {} Varleft[f(x)
ight]=overset{sim }{K}(0)end{array}} $$

Note that to apply the method we are describing, you must convert your data to be stationary, eliminating seasonality or trends in time for example.

Prediction with Gaussian Processes

Now we have reached the interesting part: given the vector f, how can we estimate f(x) at an arbitrary point x? Since we are dealing with random processes, we will estimate the probability that the unknown function assumes a given value f(x). Mathematically we will predict the following quantity
$$ pleft(f(x) 
ight| oldsymbol{f}Big) $$

Or, in other words, the probability of getting the value f(x) given the vector f, composed by all the points f(x1), …, f(xn).

Assuming that f(x) is a stationary Gaussian process with m = 0, the prediction can be shown to be given by
$$ pleft(f(x) 
ight| oldsymbol{f}Big)=frac{pleft(f(x),fleft({x}_1
ight),dots .,fleft({x}_n
ight)
ight)}{pleft(fleft({x}_1
ight),dots, fleft({x}_n
ight)
ight)}=frac{mathcal{N}left(f(x),fleft({x}_1
ight),dots, fleft({x}_n
ight) 
ight| 0,overset{sim }{C}Big)}{mathcal{N}left(fleft({x}_1
ight),dots, fleft({x}_n
ight) 
ight| 0,CBig)} $$

where with $$ mathcal{N}left(f(x),fleft({x}_1
ight),dots, fleft({x}_n
ight) 
ight| 0,overset{sim }{C}Big) $$ we have indicated the normal distribution calculated on the points with an average of 0 and a covariance matrix $$ overset{sim }{C} $$ of dimensions n + 1 × n + 1, since we have n + 1 points in the numerator. The derivation is somewhat involved and is based on several theorems, such as Bayes’ theorem. For more details you can refer to the (advanced) explanation by C.B. Do [3]. To understand what Bayesian optimization is, we can simply use the formula without derivation. C will have dimensions n × n since we have only n points in the denominator.

We have
$$ C=left(egin{array}{ccc}K(0)& Kleft({x}_1-{x}_2
ight)& dots \ {}Kleft({x}_2-{x}_1
ight)& ddots & vdots \ {}vdots & dots & K(0)end{array}
ight) $$
And
$$ overset{sim }{C}=left(egin{array}{cc}K(0)& {oldsymbol{k}}^T\ {}oldsymbol{k}& Cend{array}
ight) $$
where we have defined
$$ oldsymbol{k}=left(egin{array}{c}Kleft(x-{x}_1
ight)\ {}vdots \ {}Kleft(x-{x}_n
ight)end{array}
ight) $$
It can be shown2 that the ratio of two normal distributions is again a normal distribution, which means we can write
$$ pleft(f(x) 
ight| oldsymbol{f}left)=mathcal{N}left(f(x) 
ight| mu, {upsigma}^2
ight) $$
With
$$ {displaystyle egin{array}{l}mu ={oldsymbol{k}}^T{C}^{-1}oldsymbol{f}\ {}{sigma}^2=K(0)-{oldsymbol{k}}^T{C}^{-1}oldsymbol{k}end{array}} $$
The derivation of the exact form for μ and σ is quite long and would go beyond the scope of the book. Basically, now we know that, on average, our unknown function will assume the value μ in x with a variance σ. Let’s now see how this method works in practice in Python. Let’s define our kernel K(x)
def K(x, l, sigm = 1):
    return sigm**2*np.exp(-1.0/2.0/l**2*(x**2))
Let’s simulate our unknown function with an easy one
$$ f(x)={x}^2-{x}^3+3+10x+0.07{x}^4 $$
that can be implemented as follows
def f(x):
    return x**2-x**3+3+10*x+0.07*x**4
Let’s consider the function in the range (0, 12). Figure 6-12 shows how the function looks.
Figure 6-12

A plot of the unknown test function as described in the text

Let’s build first our f vector with five points
randompoints = np.random.random([5])*12.0
f_ = f(randompoints)
where we have used the seed 42 for the random numbers: np.random.seed(42). Figure 6-13 shows the random points marked with a cross on the plot.
Figure 6-13

A plot of the unknown function. The crosses mark the random points chosen in the text

We can apply the method described earlier with the following code
xsampling = np.arange(0,14,0.2)
ybayes_ = []
sigmabayes_ = []
for x in xsampling:
    f1 = f(randompoints)
    sigm_ = np.std(f1)**2
    f_ = (f1-np.average(f1))
    k = K(x-randompoints, 2 , sigm_)
    C = np.zeros([randompoints.shape[0],
                  randompoints.shape[0]])
    Ctilde = np.zeros([randompoints.shape[0]+1,
                       randompoints.shape[0]+1])
    for i1,x1_ in np.ndenumerate(randompoints):
        for i2,x2_ in np.ndenumerate(randompoints):
            C[i1,i2] = K(x1_-x2_, 2.0, sigm_)
    Ctilde[0,0] = K(0, 2.0, sigm_)
    Ctilde[0,1:randompoints.shape[0]+1] = k.T
    Ctilde[1:,1:] = C
    Ctilde[1:randompoints.shape[0]+1,0] = k
    mu = np.dot(np.dot(np.transpose(k), np.linalg.inv(C)), f_)
    sigma2 = K(0, 2.0,sigm_)-
             np.dot(np.dot(np.transpose(k),
                           np.linalg.inv(C)), k)
    ybayes.append(mu)
    sigmabayes_.append(np.abs(sigma2))
ybayes = np.asarray(ybayes_)+np.average(f1)
sigmabayes = np.asarray(sigmabayes_)
Take some time to study this so you understand it. In the ybayes list, we will find the values of μ(x) evaluated on the values contained in the xsampling array. Here are some hints that will help you understanding the code:
  • We loop over a range of x points where we want to evaluate our function, with the code for x in xsampling.

  • We build our k and f vectors with the code for each element of the vectors: k = K(x-randompoints, 2, sigm_) and f1 = f(randompoints). For the kernel, we chose a value for the parameter l as defined in the function of 2. We subtracted in the vector f the average to obtain m(x) = 0 to be able to apply the formulas as derived.

  • We build the matrices C and $$ overset{sim }{C} $$.

  • We calculate μ with mu = np.dot(np.dot(np.transpose(k), np.linalg.inv(C)), f_) and the standard deviation.

  • We reapply all the transformation that we did to make our process stationary in the opposite order, simply adding the average of f(x) back to the evaluated surrogate function ybayes = np.asarray(ybayes_)+np.average(f1).

Figure 6-14 shows how this method works. The dashed line is the predicted function obtained by plotting μ(x) as calculated in the code, when we have five points (n = 5). The gray area is the region between the estimated function and +/- σ.
Figure 6-14

The predicted function, dashed line, evaluating μ(x). The gray area is the region between the estimated function and +/- σ

Given the few points we have, this is actually not a bad result. Keep in mind that you still need a few points to be able to get a reasonable approximation. Figure 6-15 shows the result when we have only two points. The result is not as good. The gray area is the region around the estimated function and +/- σ. You can see that the farther you are from the points, the higher the uncertainty, or the variance, of the predicted function.
Figure 6-15

The predicted function, the dashed line, when we have only two points. The gray area is the region between the estimate function +/- σ

Let’s stop for a second and think about why we do all this. The idea behind it is to find a surrogate function $$ overset{sim }{f} $$ that approximates our function f and that has the property
$$ max overset{sim }{f}approx max f $$

This surrogate function must have another very important property: it must be cheap to evaluate. That way, we can easily find the maximum of $$ overset{sim }{f} $$ and, by using the previous property, we will have the maximum of f, that by hypothesis is costly.

As you have seen, it may be very difficult to know if we have enough points to find the right value of the maximum. After all, by definition, we do not have any idea how our black-box function looks. So how do we solve this problem?

The main idea behind the method can be described by the following algorithm:
  1. 1.

    Start with a small number of randomly chosen sample points (how many points will depend on your problem).

     
  2. 2.

    Use this set of points to get a surrogate function, as described previously.

     
  3. 3.

    Add a point to the set, with a specific method that we discuss later, and reevaluate the surrogate function.

     
  4. 4.

    If the maximum found with the surrogate function continues to change, continue adding points as in Step 3 until the maximum does not change or you run out of time or budget and cannot perform any more evaluations.

     

If the method we hinted at in Step 3 is smart enough, you will be able to find the maximum relatively quickly and accurately.

Acquisition Function

How do you choose the additional points mentioned in Step 3 in the previous section? The idea is to use an acquisition function. The algorithm works in this way:
  1. 1.

    Choose a function (and we will see a few possibilities in a moment) called an “acquisition function.”

     
  2. 2.

    Then choose as additional point x, the one at which the acquisition function has a maximum.

     

There are several acquisition functions we can use. We will describe only one that we will use to see how this method works, but there are several that you may want to check out, including entropy search, probability of improvement, expected improvement, and upper confidence bound.

Upper Confidence Bound (UCB)
In the literature you find two variations of this acquisition function. We can write the function as
$$ {a}_{UCB}(x)=mathbbm{E}overset{sim }{f}(x)+eta sigma (x) $$
where we indicate with $$ mathbbm{E}overset{sim }{f}(x) $$ the “expected” value of the surrogate function on the x-range we have in our problem. The expected value is nothing more than the average of the function over the given x range. σ(x) is the variance of the surrogate function that we calculate with our method at the point x. The new point we select is the one where aUCB(x) is maximum. η > 0 is a tradeoff parameter. This acquisition function basically selects the points where the variance is biggest. Look at Figure 6-15 again. The method would select the points where the variance is greater, so points as far as possible from the points we have already. In this way the approximation tends to get better and better. Another variation of the UCB acquisition function is the following
$$ {overset{sim }{a}}_{UCB}(x)=overset{sim }{f}left(mathrm{x}
ight)+eta sigma (x) $$

This time, the acquisition function will make a tradeoff between choosing points around the surrogate function maximum and points where its variance is the biggest. This second method works best to quickly find good approximation of the maximum of f, while the first tends to give good approximation of f over the entire x range. The next section explains how these methods work.

Example
Let’s start with the complex trigonometric function, as described at the beginning of the chapter and consider the x range [0, 40]. Our goal is to find its maximum and approximate the function. To facilitate the code, let’s define two functions: one to evaluate the surrogate function and one to evaluate the new point. To evaluate the surrogate function, we can use the following function
def get_surrogate(randompoints):
    ybayes_ = []
    sigmabayes_ = []
    for x in xsampling:
        f1 = f(randompoints)
        sigm_ = np.std(f_)**2
        f_ = (f1-np.average(f1))
        k = K(x-randompoints, 2.0, sigm_ )
        C = np.zeros([randompoints.shape[0],
                      randompoints.shape[0]])
        Ctilde = np.zeros([randompoints.shape[0]+1,
                           randompoints.shape[0]+1])
        for i1,x1_ in np.ndenumerate(randompoints):
            for i2,x2_ in np.ndenumerate(randompoints):
                C[i1,i2] = K(x1_-x2_, 2.0, sigm_)
        Ctilde[0,0] = K(0, 2.0)
        Ctilde[0,1:randompoints.shape[0]+1] = k.T
        Ctilde[1:,1:] = C
        Ctilde[1:randompoints.shape[0]+1,0] = k
        mu = np.dot(np.dot(np.transpose(k),
                           np.linalg.inv(C)), f_)
        sigma2 = K(0, 2.0, sigm_)
                 - np.dot(np.dot(np.transpose(k),
                                 np.linalg.inv(C)), k)
        ybayes_.append(mu)
        sigmabayes_.append(np.abs(sigma2))
    ybayes = np.asarray(ybayes_)+np.average(f1)
    sigmabayes = np.asarray(sigmabayes_)
    return ybayes, sigmabayes
This function has the same code as we discussed in the example in the previous sections, but it’s packed in a function that returns the surrogate function, contained in the array ybayes, and the σ2, contained in the array sigmabayes. Additionally, we need a function that evaluates the new points using the acquisition function aUCB(x). We can get it with this function
def get_new_point(ybayes, sigmabayes, eta):
    idxmax = np.argmax(np.average(ybayes)
             +eta*np.sqrt(sigmabayes))
    newpoint = xsampling[idxmax]
    return newpoint
To make things simpler, we define the array that contains all the x values we want at the beginning outside the functions. Let’s start with six randomly selected points. To see how this method is doing, let’s start with some definitions
xmax = 40.0
numpoints = 6
xsampling = np.arange(0,xmax,0.2)
eta = 1.0
np.random.seed(8)
randompoints1 = np.random.random([numpoints])*xmax
In the randompoints1 array, we have our first six selected random points. We can easily get the surrogate function with our function with
ybayes1, sigmabayes1 = get_surrogate(randompoints1)
Figure 6-16 shows the result. The dotted line is the acquisition function aUCB(x) normalized to fit in the plot.
Figure 6-16

An overview of the black-box function f(x) (the solid line), the randomly selected points (marked by crosses), the surrogate function (the dashed line), and the acquisition function aUCB(x) (the dotted line, shifted to fit in the plot). The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

The surrogate function is not very good yet, since we don’t have enough points, and the big variance (the gray area) makes this evident. The only region that is well approximated is the region x ≳ 35. You can see how the acquisition function is big when the surrogate function is not approximating the black-box function well, and small when it does, as for example with x ≳ 35. So choosing a new point where aUCB(x) is at its maximum is equivalent to choosing the points where the function is less well approximated, or in more mathematical terms, where the variance is bigger. For a comparison, Figure 6-17 shows the same plot as in Figure 6-16, but with the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$ and with η = 3.0.
Figure 6-17

An overview of the black-box function f(x) (the solid line), the randomly selected points (marked by crosses), the surrogate function (the dashed line), and the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$ (the dotted line, shifted to fit in the plot). The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

As you can see, $$ {overset{sim }{a}}_{UCB}(x) $$ tends to have a maximum around the maximum of the surrogate function. Keep in mind that if η is big, then the maximum of the acquisition function will shift toward regions with high variance. But this acquisition function tends to find “a” maximum slightly faster. I said “a” since it depends on where the maximum of the surrogate function is, and not where the maximum of the black-box function is.

Let’s see now what happens while using aUCB(x) with η = 1.0. For the first additional point, we need to run the following three lines of code
newpoint = get_new_point(ybayes1, sigmabayes1, eta)
randompoints2 = np.append(randompoints1, newpoint)
ybayes2, sigmabayes2 = get_surrogate(randompoints2)
For the sake of simplicity, we name each array for each step differently, instead of creating a list. But typically, you should make these iterations automatic. Figure 6-18 shows the result with the additional point, marked with a black circle.
Figure 6-18

An overview of the black-box function f(x) (the solid line), the randomly selected points (marked with crosses), and with the new selected point around x ≈ 27 (marked by a circle), the surrogate function (the dashed line), and the acquisition function aUCB(x) (the dotted line, shifted to fit in the plot). The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

The new point is around x ≈ 27. Let’s continue to add points. Figure 6-19 shows the results after adding five points.
Figure 6-19

An overview of the black-box function f(x), the randomly selected points with the new six points, the surrogate function, and the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$. The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

Look at the dashed line. The surrogate function now approximates the black-box function quite well, especially around the real maximum. Using this surrogate function, we can find a very good approximation of our original function with just 11 evaluations in total! Keep in mind we don’t have any additional information about f, except the 11 evaluations.

Now let’s see what happens with the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$, and let’s check how fast we can find the maximum. In this case, let’s use η = 3.0, to get a better balance between the surrogate function’s maximum and its variance. Figure 6-20 shows the result after adding one additional point, marked by a black circle. We already have a good approximation of the real maximum!
Figure 6-20

An overview of the black-box function f(x), the randomly selected points with the additional selected points, the surrogate function, and the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$. The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

Now let’s add another point. Figure 6-21 shows that the additional point is still close to the maximum but has shifted in the direction of the area with a high variance around 30.
Figure 6-21

An overview of the black-box function f(x), the randomly selected points with the additional selected points, the surrogate function, and the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$. The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

If we chose a smaller η, the point would be closer to the maximum, and if we chose a bigger one, the point would be closer to the point with the highest variance between 25 and roughly 32. Let’s add another point and see what happens. Figure 6-22 shows how the method now chooses a point close to another region with high variance, between 10 and roughly 22, again marked by a black circle.
Figure 6-22

An overview of the black-box function f(x), the randomly selected points, the surrogate function, and the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$. The additional selected point is marked by the black circle. The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

And finally, the method refines the maximum area around 15, as you can see in Figure 6-23, adding a point around 14, marked by the black circle.
Figure 6-23

An overview of the black-box function f(x), the randomly selected points with the additional selected points, the surrogate function, and the acquisition function $$ {overset{sim }{a}}_{UCB}(x) $$. The gray area is the region contained between the lines $$ overset{sim }{f}(x)+sigma (x) $$ and $$ overset{sim }{f}(x)-sigma (x) $$

As this discussion and comparison of the behavior of the two types of acquisition functions should have made clear by now, choosing the right acquisition function is dependent on which strategy you want to apply to approximate your black-box function.

Note

Different types of acquisition functions will give different strategies in approximating the black-box function. For example, aUCB(x) will add points in regions with the highest variance, while $$ {overset{sim }{a}}_{UCB}(x) $$ will add points finding a balance, regulated by η, between the maximum of the surrogate function and areas with high variance.

An analysis of all the different types of acquisition function would go far beyond the scope of this book. A good deal of research and reading published papers is required to get enough experience and to understand how different acquisition functions work and behave.

If you want to use Bayesian optimization with your TensorFlow model, you do not have to develop the method completely from scratch. You can try the library called GPflowOpt from N. Knudde et al. [4].

Sampling on a Logarithmic Scale

There is a last small subtlety that we need to discuss. Sometimes you will find yourself in a situation where you want to try a big range of possible values for a parameter, but you know from experience that the best value is probably in a specific range. Let’s suppose you want to find the best value for the learning rate for your model, and you decide to test values from 10−4 to 1, but you know, or at least expect, that your best value lies between 10−3 and 10−4. Now let’s suppose you are working with grid search and you sample 1,000 points. You may think you have enough points, but you will get
  • 0 points between 10−4 and 10−3

  • 8 points between 10−3 and 10−2

  • 89 points between 10−1 and 10−2

  • 899 points between 1 and 10−1

You get a lot more points in the less interesting ranges, and none where you want them. Figure 6-24 shows the distribution of the points. Note that the x-axis uses a logarithm scale. You can clearly see how you get many more points for bigger values of the learning rate.
Figure 6-24

The distribution of 1,000 points selected with grid search on a logarithmic x-scale

You probably want to sample in a much finer way for smaller values of the learning rate than for bigger ones. What you should do is sample your points on a logarithmic scale. The basic idea is that you want to sample the same number of points between 10−4 and 10−3, 10−3 and 10−2, 10−1 and 10−2, and 10−1 and 1. To do that, you can use the following Python code. First select a random number between 0 and a negative of the absolute value of the highest number of the power of 10 you have; in this case it’s -4.
r = - np.arange(0,1,0.001)*4.0
Then your array with the selected points can be created with the following
points2 = 10**r
Figure 6-25 shows how the distributions of the points contained in the array points2 is now completely flat, as we wanted.
Figure 6-25

The distribution of 1,000 points selected with grid search on a logarithmic x-scale with the modified selection method

You get 250 points in each region, and you can easily check this code for the range 10−3 and 10−4. For the other ranges simply change the numbers in the code.
print(np.sum((alpha <= 1e-3) & (alpha > 1e-4)))

Now you can see how you have the same number of points between the different powers of ten. With this simple trick you can ensure that you get enough points in region of your chosen range, where otherwise you would get almost no points. Remember that in this example, with 1,000 points, we get zero points between 10−3 and 10−4 using the standard method . This range is the most interesting for the learning rate, so you want to have enough points in this range to optimize your model. Note that the same applies to random search. It works in the exact same way.

Hyper-Parameter Tuning with the Zalando Dataset

To give you a concrete example of how hyper-parameter tuning works, let’s apply what you have learned in a simple case. Let’s start with the data, as usual. Let’s use the Zalando dataset from Chapter 3. For a complete discussion, refer to that chapter. Let’s quickly load and prepare the data and then discuss tuning.

First, as usual, we need the necessary libraries
# general libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from random import *
import time
# tensorflow libraries
from tensorflow.keras.datasets import fashion_mnist
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
Then, to retrieve the dataset, we can simply run the following command
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()

Remember we have 60,000 observations in the train dataset and 10,000 in the dev dataset. After executing the command, we have two NumPy matrices (trainX and testX) containing all the pixel values that describe each of the training and test images and two NumPy arrays (trainY and testY) containing the associated labels.

Let’s print the training dataset dimensions, as an example
print('Dimensions of the training dataset: ', trainX.shape)
print('Dimensions of the training labels: ', trainY.shape)
That will return as output
Dimensions of the training dataset:  (60000, 28, 28)
Dimensions of the training labels:  (60000,)

Remember that 784 represents the gray values of the image pixels (that have a size of 28 × 28 pixels).

Now we need to modify the data to obtain a “flattened” version of each image, meaning an array of 754 pixels, instead of a matrix of 28x28 pixels.

The following lines reshape the matrixes dimensions
data_train = trainX.reshape(60000, 784)
data_test = testX.reshape(10000, 784)
Finally let’s normalize the input so that instead of having values from 0 to 255 (the grayscale values), it has only values between 0 and 1. This is very easy to do with this code
data_train_norm = np.array(data_train/255.0)
data_test_norm = np.array(data_test/255.0)
Before moving to developing our network, we need to solve a final problem. Labels must be provided in a different form, when performing a multiclass classification task. The Python code to do this is very simple:
labels_train = np.zeros((60000, 10))
labels_train[np.arange(60000), trainY] = 1
labels_test = np.zeros((10000, 10))
labels_test[np.arange(10000), testY] = 1

Now we have prepared the data as we need. For a complete discussion, refer to Chapter 3, where we spend a lot of time preparing the data for this dataset. Now let’s move on to the model. Let’s start with something easy. As a metric, let’s use for this example the accuracy, since the dataset is balanced. Let’s consider a network with just one layer and see what number of neurons provides the best accuracy.

Our hyper-parameter in this example will be the number of neurons in the hidden layer. Basically, we need to build a new network for each value of the hyper-parameter (the number of neurons in the hidden layer) and train it. We need two functions: one to build the network and one to train it. To build the model, we can define the following function
def build_model(number_neurons, learning_rate):
  # create model
  model = keras.Sequential()
  # add first hidden layer and set input dimensions
  model.add(layers.Dense(number_neurons, input_dim = 784,
                         activation = 'relu'))
  # add output layer
  model.add(layers.Dense(10, activation = 'softmax'))
  # compile model
  model.compile(loss = 'categorical_crossentropy',
                optimizer = tf.keras.optimizers.SGD(
                               learning_rate = learning_rate),
                   metrics = ['categorical_accuracy'])
     return model
You should understand this function, since we used the code several times in the book already. This function has an input parameter, number_neurons, that will contain, as the name indicates, the number of neurons in the hidden layer. The function to train the model will look like this
def train_model(number_neurons, learning_rate, number_epochs,
                mb_size):
  # build model
  model = build_model(number_neurons, learning_rate)
  # train model
  history = model.fit(
    data_train_norm, labels_train,
    epochs = number_epochs, verbose = 0,
    batch_size = mb_size,
    callbacks = [tfdocs.modeling.EpochDots()])
  # test model
  train_loss, train_accuracy = model.evaluate(
                               data_train_norm, labels_train,
                               verbose = 0)
  test_loss, test_accuracy = model.evaluate(
                               data_test_norm, labels_test,
                               verbose = 0)
  return train_accuracy, test_accuracy
You have already seen a very similar function several times already. The main parts should be clear. You will find a few things that are new. First, we build the model in the function itself with
model = build_model(number_neurons, learning_rate)

Additionally, we evaluate the accuracy on the train and dev datasets and return the values to the caller. In this way we can run a loop for several values of the number of neurons in the hidden layer and get the accuracies. Note that this time the function has an additional input parameter: number_neurons. We need to pass this number to the function that builds the model.

Let’s suppose we choose the following parameters: mini-batch size = 50, we train for 100 epochs, learning rate λ = 0.001, and we build our model with 15 neurons in the hidden layer.

We then run the model
train_accuracy, test_accuracy = train_model(15, 0.001, 100, 50)
print(train_accuracy)
print(test_accuracy)
For the train dataset, we get 0.8549 and for the dev dataset, we get 0.8396 accuracy. Can we do better? Well, we can surely do a grid search to start with
nn = [1, 5, 10, 15, 25, 30, 50, 150, 300, 1000, 3000]
for nn_ in nn:
    train_accuracy, test_accuracy = train_model(nn_, 0.001, 100, 50)
    print('Number of neurons:', nn_, 'Acc. Train:', acc_train, 'Acc. Test', acc_test)
Keep in mind that this will take a lot of time. 3000 neurons is a large number, so be warned in case you want to try. We will get the results shown in Table 6-1.
Table 6-1

Overview of the Accuracy on the Training and Test Datasets for a Different Number of Neurons

Number of Neurons

Accuracy on the train Dataset

Accuracy on the test Dataset

1

0.1000

0.1000

5

0.8242

0.8123

10

0.8476

0.8323

15

0.8549

0.8373

25

0.8604

0.8444

30

0.8577

0.8413

50

0.8661

0.8470

150

0.8677

0.8513

300

0.8694

0.8512

1000

0.8740

0.8555

3000

0.8754

0.8569

Not surprisingly, more neurons deliver better accuracy, with no signs of overfitting of the train dataset, since the accuracy on the dev dataset is almost equal to the one on the train dataset. Figure 6-26 shows a plot of the accuracy on the test dataset vs. the number of neurons in the hidden layer. Note that the x-axis uses a logarithmic scale, to make the changes more evident.
Figure 6-26

Accuracy on the test dataset vs. the number of neurons in the hidden layer

If your goal is to reach 80% accuracy, you could well stop here. But there are a few things to consider: first, we may be able to do better, and second, training the network with 3,000 neurons takes a lot of time. On a medium-performance laptop roughly 35 minutes. We should see if we can get the same result in a fraction of the time. We want a model that trains as fast as possible! Let’s try a slightly different approach. Since we want to be faster let’s consider a model with four layers. We could also tune the number of layers, but let’s stick to four for this example and tune the other parameters. We will try to find the optimal value for the learning rate, the mini-batch size, the number of neurons in each layer, and the number of epochs. We will use random search. For each parameter, we will randomly select ten values:
  • Number of neurons: Between 35 and 60

  • Learning rate: We will use the search on the logarithmic scale between 10−1 and 10−3

  • Mini-batch size: Between 20 and 80

  • Number of epochs: Between 40 and 100

We can create arrays with the possible values using this code
neurons_ = np.random.randint(low = 35, high = 60.0, size = (10))
r = - np.random.random([10])*3.0 - 1
learning_ = 10**r
mb_size_ = np.random.randint(low = 20, high = 80, size = 10)
epochs_ = np.random.randint(low = 40, high = 100, size = (10))
Note that we will not try all possible combinations. We consider only ten possible combinations: the first value of each array, the second value of each array, and so on. Let’s see how efficient random search can be with just ten evaluations. We can test this model with the following loop
for i in range(len(neurons_)):
    train_accuracy, test_accuracy = train_model(neurons_[i], learning_[i], epochs_[i], mb_size_[i])
    print()
    print('Epochs:', epochs_[i], 'Number of neurons:', neurons_[i], 'Learning rate:', learning_[i], 'Minibatch size', mb_size_[i],
          'Training accuracy:', train_accuracy, 'Test Accuracy', test_accuracy)

You will find that the combinations with 44 epochs, 36 neurons in each layer, a learning rate of 0.07537, and a mini-batch size of 54 gives you an accuracy on the dev dataset of 0.86. Not bad, considering that this run took 1.5 minutes, so 23 times faster than the model with one layer and 3,000 neurons. And it’s 2% better. Our naïve initial test gave us an accuracy of 0.84, so with hyper-parameter tuning, we got 2% better than our initial guess. In the deep learning, this is considered a good result. What we did should give you an idea of how powerful hyper-parameter tuning can be if it’s done properly. Keep in mind that you should spend quite some time doing it, and especially thinking about how to do it.

Note

Always think about how you want to do your hyper-parameter tuning and use your experience or ask for help from someone with experience. It is useless to invest time and resources to try combinations of parameters that you know will not work. For example, it's better to spend time testing learning rates that are very small than to test learning rates around 1. Remember that every training round of your network will take time, even if the results are not useful!

The point of this last section is not to get the best model possible, but to give you an idea about how the tuning process may work. You could continue trying different optimizers (for example Adam), considering wider ranges for the parameters, more parameter combinations, and so on.

A Quick Note about the Radial Basis Function

Before finishing this chapter, let’s discuss a minor point about the Radial Basis Function
$$ Kleft(x,{x}_i
ight)={sigma}^2{e}^{-frac{1}{2{l}^2}{leftVert x-{x}_i
ightVert}^2} $$
It is important that you understand the role of the parameter l. In our examples, we chose l = 2 but we did not discuss why. Choosing an l that’s too small will make the acquisition function develop very narrow peaks around the points we already have, as you can see in the left plot of Figure 6-27. Big values of l will have a smoothing effect on the acquisition function, as you can see in the middle and right plots of Figure 6-27.
Figure 6-27

The effect of changing the l parameter in the radial basis function

Usually it’s good practice to avoid values for l that are too small or too big, to be able to have a variance that varies in a smooth way between known points, as in Figure 6-27 for l = 1. Having a very small l will make the variance between points almost constant and therefore make the algorithm almost always choose the middle point between points, as you can check from the acquisition function. While choosing an l that’s too big will make the variance small, and therefore difficult to use. As you can see for l = 5 in Figure 6-27, the acquisition function is almost constant. Typical values are around 1 or 2.

Exercises

Exercise 1 (Level: Easy)

Try different optimizers (for example, Adam) for the examples in this chapter and consider wider ranges for the parameters, more parameters combinations, and so on. See how the results change.

Exercise 2 (Level: Difficult)

Try to implement Bayesian optimization from scratch, as described in this chapter.

Exercise 3 (Level: Medium)
Try to optimize a multiclass classification model like the one you saw in this chapter, but using the MNIST database of handwritten digits (http://yann.lecun.com/exdb/mnist/). To download the dataset from TensorFlow, use the following lines of code:
from tensorflow import keras
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

References

  • [1] Bergstra, James, and Yoshua Bengio. “Random search for hyper-parameter optimization.” Journal of Machine Learning Research, 13.2 (2012).

  • [2] Nadaraya, Elizbar A. “On estimating regression.” Theory of Probability & Its Applications, 9.1 (1964): 141-142.

  • [3] Do, Chuong B. “Gaussian processes.” Stanford University, Stanford, CA, accessed Dec 5 (2007): 2017.

  • [4] Knudde, Nicolas, et al. “GPflowOpt: A Bayesian optimization library using TensorFlow.” arXiv preprint arXiv:1711.03845 (2017).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.156.176