© Umberto Michelucci 2022
U. MichelucciApplied Deep Learning with TensorFlow 2https://doi.org/10.1007/978-1-4842-8020-1_5

5. Advanced Optimizers

Umberto Michelucci1  
(1)
Dübendorf, Switzerland
 

In general, optimizers are algorithms that minimize a given function. Remember that training a neural network means simply minimizing the loss function. Chapter 1 looked at the gradient descent optimizer and its variations (mini-batch and stochastic). In this chapter, we look at more advanced and efficient optimizers. We look in particular at Momentum, RMSProp, and Adam. We cover the mathematics behind them and then explain how to implement and use them in Keras.

Available Optimizers in Keras in TensorFlow 2.5

TensorFlow has evolved a lot in the last few years. In the beginning, you could find gradient descent as a class, but it is not available anymore (not directly). Keras offers as of TensorFlow version 2.5 only advanced algorithms. In particular, you will find the following: Adadelta, Adagrad, Adam. Adamax, Ftrl, Nadam, RMSProp, and SGD (gradient descent with momentum). So plain gradient descent is not available anymore out of the box from Keras.

Note

In general, you should determine which optimizers work best for your problem, but if you don’t know where to start, Adam is always a good choice.

Advanced Optimizers

Up to now we have only discussed how gradient descent works. That is not the most efficient optimizer (although it’s the easiest to understand), and there are some modifications to the algorithm that can make it faster and more efficient. This is a very active area of research, and you will find an incredible number of algorithms based on different ideas. This chapter looks at the most instructive and famous ones: Momentum, RMSProp, and Adam.

Additional material that investigates the more exotic optimizers has been written by S. Ruder in a paper called “An overview of gradient descent optimization algorithms” (see https://goo.gl/KgKVgG). The paper is not for beginners and requires a strong mathematical background, but it provides an overview of the more exotic algorithms, including Adagrad, Adadelta, and Nadam. Additionally, it reviews weight-update schemes applicable in distributed environments like Hogwild!, Downpour SGD, and many more. Surely a read worth your time.

To understand the basic idea of momentum (and partially also of RMSProp and Adam), you first need to understand what exponentially weighted averages are. So, let’s start with some mathematics.

Exponentially Weighted Averages

Let’s suppose you are measuring a quantity θ (it could be the temperature where you live, for example) over time—once a day, for example. You will have a series of measurements that you can indicate with θi, where i goes from 1 to a certain number N. Now bear with me if this does not make much sense at first. Let’s define recursively a quantity vn, as follows
$$ {displaystyle egin{array}{l}{v}_0=1\ {}{v}_1=eta {v}_0+left(1-eta 
ight){	heta}_1\ {}{v}_2=eta {v}_1+left(1-eta 
ight){	heta}_2end{array}} $$
and so on with β a real number and 0 < β < 1. Generally, we could write the nth term with a recursive formula, as follows
$$ {v}_n=eta {v}_{n-1}+left(1-eta 
ight){	heta}_n $$
Now let’s write all the terms v1, v2 and so on as a function of β and θi (so not recursively). For v2 we have
$$ {v}_2=eta left(eta {v}_0+left(1-eta 
ight){	heta}_1
ight)+left(1-eta 
ight){	heta}_2={eta}^2+left(1-eta 
ight)left(eta {	heta}_1+{	heta}_2
ight) $$
For v3 we have
$$ {v}_3={eta}^3+left(1-eta 
ight)left[{eta}^2{	heta}_1+eta {	heta}_2+{	heta}_3
ight] $$
Generalizing, we obtain
$$ {v}_n={eta}^n+left(1-eta 
ight)left[{eta}^{n-1}{	heta}_1+{eta}^{n-2}{	heta}_2+dots +{	heta}_n
ight] $$
Or in a more elegant way (without the three dots), we get
$$ {v}_n={eta}^n+left(1-eta 
ight)sum limits_{i=1}^n{eta}^{n-i}{	heta}_i $$
Now let’s try to understand what this formula means. First of all, note that the term βn disappears when you choose v0 = 0. Let’s do that (set v0 = 0) and consider what remains
$$ {v}_n=left(1-eta 
ight)sum limits_{i=1}^n{eta}^{n-i}{	heta}_i $$
Are you still with me? Now comes the interesting part. Let’s define the convolution between two sequences.1 Let’s consider two sequences, xn and hn. The convolution between the two (which we indicate with the symbol ∗) is defined by
$$ {x}_nast {h}_n=sum limits_{k=-infty}^{infty }{x}_k{h}_{n-k} $$
Since we only have a finite number of measurements for our quantity θi, we can define
$$ {	heta}_k=0kern1.5em k&gt;n,kle 0 $$
Therefore, we can write vn as a convolution as follows
$$ {v}_n={	heta}_nast {b}_n $$
Where we have defined
$$ {b}_n=left(1-eta 
ight){eta}^n $$
To get an idea of what that means, let’s plot θn, bn, and vn. To do that, let’s assume θn has a Gaussian shape (the exact form is not relevant, is just for illustrative purposes), and let’s use β = 0.9 (see Figure 5-1).
Figure 5-1

The left side is a plot showing θn (solid line) and bn (dotted line) together. The right side is a plot showing the points that need to be summed to obtain vn for n = 50

Let’s discuss briefly Figure 5-1. The Gaussian curve (θn) will be convoluted with bn to obtain vn. The result can be seen in the right plot. All those terms (1 − β)βn − iθi for i = 1, …, 50 (plotted in the right plot) will be summed to obtain v50. Note that vn is the average of all θn for n = 1, …, 50. Each term is multiplied by a term (bn); that is, 1 for n = 50 and then this decreases rapidly for n decreasing toward 1. Basically, this is a weighted average, with an exponentially decreasing weight. The terms farther from n = 50 are less and less relevant, while the terms close to n = 50 get more weight. This is also a moving average. For each n, all the preceding terms are added and multiplied by a weight (bn).

I would like now to show you why there is this factor 1 − β in bn. Why not just choose βn? The reason is very simple. The sum of bn over all positive ns is equal to 1. Let’s see why.
$$ sum limits_{k=1}^{infty }{b}_k=left(1-eta 
ight)sum limits_{k=1}^{infty }{eta}^n=left(1-eta 
ight)underset{N	o infty }{lim}frac{1-{eta}^{N+1}}{1-eta }=left(1-eta 
ight)frac{1}{1-eta }=1 $$
We used the fact that for β < 1 we have $$ underset{N	o infty }{lim }{eta}^{N+1}=0 $$ and that for a geometric series we have
$$ sum limits_{k=1}^na{r}^{k-1}=frac{aleft(1-{r}^n
ight)}{1-r} $$

The algorithm we described to calculate vn is nothing else than the convolution of our quantity θi, with a series that has a sum equal to one and has the form (1 − β)βi.

Note

The exponentially weighted average vn of a series of a quantity θn is the convolution vn = θn ∗ bn of our quantity θi with bn = (1 − β)βn, where bn has the property that its sum over the positive values of n is equal to 1. It is the moving average, where each term is multiplied by the weights given by the sequence bn.

As you choose smaller and smaller β values, the number of points θn that have a weight that’s significantly different than zero decreases, as you can see in Figure 5-2, where we have plotted the series bn for different values of β.
$$ n=0. $$
Figure 5-2

The series bn for three values of β: 0.9, 0.8, and 0.3. Note that as β gets smaller, the series is significantly different than zero for an increasingly smaller number of values

This method is at the very core of the Momentum optimizer and more advanced learning algorithms, and you will see in the next sections how it works in practice.

Momentum

Recall that in plain gradient descent the weight updates are calculated with these equations
$$ left{egin{array}{l}{oldsymbol{w}}_{left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-gamma {
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight)\ {}{b}_{left[n+1
ight]}={b}_{left[n
ight]}-gamma frac{partial Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight)}{partial b}end{array}
ight. $$
Where the subscript in square brackets indicates the iteration. The idea behind the Momentum optimizer is to use exponentially weighted averages of the corrections of the gradient for the weight updates. Mathematically, we formulate the previous statement as follows
$$ left{egin{array}{l}{oldsymbol{v}}_{w,left[n+1
ight]}=eta {oldsymbol{v}}_{w,left[n
ight]}+left(1-eta 
ight){
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight)\ {}{v}_{b,left[n+1
ight]}=eta {v}_{b,left[n
ight]}+left(1-eta 
ight)frac{partial Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight)}{partial b}end{array}
ight. $$
Then perform the weight and bias updates with these equations
$$ left{egin{array}{l}{oldsymbol{w}}_{left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-gamma {oldsymbol{v}}_{w,left[n
ight]}\ {}{b}_{left[n+1
ight]}={b}_{left[n
ight]}-gamma {v}_{b,left[n
ight]}end{array}
ight. $$

Where vw, [0] = 0 and vb, [0] = 0 are usually chosen. Now that means, instead of using the derivatives of the cost functions with respect to the weights, we update the weights with a moving average of the derivatives. Usually, experience shows that a bias correction could theoretically be neglected.

Note

The Momentum algorithm uses an exponential weighted average of the derivates of the cost function with respect to the weights for the weight updates. This way, in addition to the derivatives at a given iteration being used, the past behavior is also considered. It may happen that the algorithm oscillates around the minimum instead of converging directly. This algorithm can escape from plateaus much more efficiently than standard gradient descent.

Sometimes you find a slightly different formulation in books or blogs, which is (we report only the equation for the weights w for brevity)
$$ {oldsymbol{v}}_{w,left[n+1
ight]}=gamma {oldsymbol{v}}_{w,left[n
ight]}+eta {
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight) $$
The idea and meaning remain the same; it’s simply a slightly different mathematical formulation. We find that the method described is easier to understand with the notion of sequence convolution and of weighted averages than this second formulation. Another formulation that you will find (the one that TensorFlow uses) is
$$ {v}_{w,left[n+1
ight]}={eta}^t{v}_{w,left[n
ight]}+{
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight) $$
Where ηt is called by TensorFlow momentum (the superscript t indicates that this variable is used by TensorFlow). In this formulation, the weight update assumes the form
$$ {oldsymbol{w}}_{left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-{gamma}^t{v}_{w,left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-{gamma}^tleft({eta}^t{v}_{w,left[n
ight]}+{
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight)
ight)={oldsymbol{w}}_{left[n
ight]}-{gamma}^t{eta}^t{v}_{w,left[n
ight]}-{gamma}^t{
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight) $$
Where again the superscript t indicates that the variable is the one used by TensorFlow. Although it seems different, this formulation is equivalent to the formulation shown earlier:
$$ {oldsymbol{w}}_{left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-gamma eta {v}_{w,left[n
ight]}-gamma left(1-eta 
ight){
abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n
ight]},{b}_{left[n
ight]}
ight) $$
The TensorFlow formulation and the one discussed earlier are equivalent if we choose
$$ left{egin{array}{l}eta =frac{eta }{1-eta}\ {}{gamma}^t=gamma left(1-eta 
ight)end{array}
ight. $$

This can be seen by simply comparing the two different equations for the weight updates. Values around η = 0.9 in TensorFlow implementations are used and they typically work well. The momentum almost always converges faster than with plain gradient descent .

Note

Comparing the different parameters in the different optimizers is wrong. When talking about the learning rate, for example, it has a different meaning in the different algorithms. What you should compare is the best convergence speed you can achieve with several optimizers, regardless of the choice of parameters. Comparing the GD for a learning rate of 0.01 with Adam (you will see it later) for the same learning rate does not make much sense. You should compare the optimizers with the parameters to see which gives you the best and fastest convergence.

RMSProp

Let’s move on to something a bit more complex, but usually more efficient. Let’s look at the mathematical equations and then we will compare them to the others we have seen so far. At each iteration, we need to calculate
$$ left{egin{array}{l}{oldsymbol{S}}_{oldsymbol{w},left[n+1
ight]}={eta}_2{oldsymbol{S}}_{oldsymbol{w},left[n
ight]}+left(1-{eta}_2
ight){
abla}_{oldsymbol{w}}Jleft(oldsymbol{w},b
ight)circ {
abla}_{oldsymbol{w}}Lleft(oldsymbol{w},b
ight)\ {}{S}_{b,left[n+1
ight]}={eta}_2{S}_{b,left[n
ight]}+left(1-{eta}_2
ight)frac{partial Lleft(w,b
ight)}{partial b}circ frac{partial Lleft(w,b
ight)}{partial b}end{array}
ight. $$
where the symbol ∘ indicates an element-wise product. Then we will update our weights with the equations
$$ left{egin{array}{l}{oldsymbol{w}}_{left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-frac{gamma {
abla}_{oldsymbol{w}}Lleft(oldsymbol{w},b
ight)}{sqrt{{oldsymbol{S}}_{oldsymbol{w},left[n+1
ight]}+epsilon }} \ {}{b}_{left[n+1
ight]}={b}_{left[n
ight]}-gamma frac{partial Lleft(w,b
ight)}{partial b}frac{1}{sqrt{S_{b,left[n
ight]}+epsilon }}end{array}
ight. $$
So first you do an exponential weighted average of the quantities Sw, [n + 1] and Sb, [n + 1] and then you use them to modify the derivates that you use to do your weight updates. The ϵ, which is usually ϵ = 10−8, is there to prevent the denominator from going to 0 when the quantities Sw, [n + 1] and Sb, [n + 1] go to 0. The idea is that if the derivative is big, the S quantities are big. Therefore, the factors $$ 1/sqrt{{oldsymbol{S}}_{oldsymbol{w},left[n+1
ight]}+epsilon } $$ and $$ 1/sqrt{S_{b,left[n
ight]}+epsilon } $$ will be smaller and therefore the learning will slow down. The other way around is also true, so if the derivatives are small, the learning will be faster. This algorithm will make the learning faster for the parameters that are slowing it down. In TensorFlow is particular easy to use it simply with this code:
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.1)

Adam

The last algorithm we look at is called Adam (Adaptive Moment estimation) . It combines the ideas of RMSProp and Momentum into one optimizer. Like Momentum, it uses an exponential weighted average of past derivatives and, like RMSProp, it uses the exponentially weighted averages of past squared derivatives.

You need to calculate the same quantities that you need for Momentum and for RMSProp and then calculate the following quantities
$$ {displaystyle egin{array}{c}{oldsymbol{v}}_{oldsymbol{w},left[n
ight]}^{corrected}=frac{{oldsymbol{v}}_{oldsymbol{w},left[n
ight]}}{1-{eta}_1^n}\ {}{v}_{b,left[n
ight]}^{corrected}=frac{v_{b,left[n
ight]}}{1-{eta}_1^n}end{array}} $$
And in the same way
$$ {displaystyle egin{array}{c}{oldsymbol{S}}_{oldsymbol{w},left[n
ight]}^{corrected}=frac{{oldsymbol{S}}_{oldsymbol{w},left[n
ight]}}{1-{eta}_2^n}\ {}{S}_{b,left[n
ight]}^{corrected}=frac{S_{b,left[n
ight]}}{1-{eta}_2^n}end{array}} $$
Where we use β1 for the hyper-parameter, used in Momentum, and β2 for the one we used in RMSProp. Then, similar to what we did in RMSProp, we update our weights with the equations
$$ left{egin{array}{l}{oldsymbol{w}}_{left[n+1
ight]}={oldsymbol{w}}_{left[n
ight]}-frac{gamma {oldsymbol{v}}_{oldsymbol{w},left[n
ight]}^{corrected}}{sqrt{{oldsymbol{S}}_{oldsymbol{w},left[n+1
ight]}^{corrected}+epsilon }} \ {}{b}_{left[n+1
ight]}={b}_{left[n
ight]}-gamma frac{v_{b,left[n
ight]}^{corrected}}{sqrt{S_{b,left[n
ight]}^{corrected}+epsilon }}end{array}
ight. $$
TensorFlow does everything for you if you simply use the following line
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07)

where in this case the typical values for the parameters have been chosen: γ = 0.001, β1 = 0.9, β2 = 0.999, and ϵ = 10−7. Note how, since this algorithm adapts the learning rate to the situation, we can start with a bigger learning rate to speed up the convergence.

Comparison of the Optimizers’ Performance

It is interesting to see how the optimizers behave to better understand why, for example, Adam is so efficient. To do that, let’s create a toy problem. Let’s consider a dataset of 30 tuples (xi, yi) with
$$ {x}_i=-1+frac{2}{30}i $$
And
$$ {y}_i=2+frac{1}{2}{x}_i $$
Let’s compare gradient descent, Adam, and RMSProp. The goal of the problem is, starting from the 30 data points, to determine the linear relationships between xi and yi. In other words, to determine the two parameters—2 and 1/2—in the last equation. For this example, we can write the linear relation as follows
$$ {y}_i={w}_0+{w}_1{x}_i $$
The optimizers will try to minimize the MSE with respect to w0 and w1. You will find the entire code to do this comparison in the online version of the book at https://adl.toelt.ai. In the online code, you will see an implementation of the gradient descent from scratch. In Figure 5-3, you can see the path that the GD algorithm takes in parameter space to reach the minimum of the MSE function in 200 iterations.
Figure 5-3

The path that the GD algorithm follows in parameter spaces while minimizing the MSE when starting from (0,0). The number of iterations used for this plot is 200

Figure 5-4 shows the different paths taken by the different optimizers when solving the same problem.
Figure 5-4

The path that the GD algorithm, Adam, and RMSProp optimizers follow in parameter spaces while minimizing the MSE when starting from (0,0). The number of iterations used for this plot is 200

One striking difference in Figure 5-4 is that while the GD path is rather direct, the others tends to be less direct, with Adam making loops around the minimum and RMSProp oscillating when close to the minimum. From the plot, it’s difficult to see which one is faster, and the best way of checking that is to plot $$ {w}_0^{left[i
ight]} $$ and $$ {w}_1^{left[i
ight]} $$ vs. i, where i indicates the iteration number. In Figure 5-5, you can see how quickly $$ {w}_0^{left[i
ight]} $$ converges to the expected value of 2.0 with the different algorithms. In this case, Adam and GD are on par, while RMSProp seems to be slower.
Figure 5-5

The plot of $$ {w}_0^{left[i
ight]} $$ vs. the number of iterations with different optimizers

A different picture emerges when considering $$ {w}_1^{left[i
ight]} $$. Figure 5-6 shows how quickly the different algorithms converge.
Figure 5-6

The plot of $$ {w}_1^{left[i
ight]} $$ vs. the number of iterations with different optimizers

In this case, you can see how much faster Adam and RMSProp are in converging. It is interesting to note how Adam oscillates around the expected value of 0.5. This property makes Adam very efficient in escaping areas in parameters’ space, where training can get stuck. But you should notice something else that is very interesting if you zoom in on the picture after iteration 150, as you can see in Figure 5-7.
Figure 5-7

Zoom of Figure 6-5. Plot of $$ {w}_1^{left[i
ight]} $$ vs. the number of iterations with different optimizers

In Figure 5-7, you can see how, while GD is still converging, Adam is already at the expected value, and RMSProp oscillates around it, since probably it cannot converge completely. This should convince you how efficient Adam is in comparison with other optimizers.

Small Coding Digression

I want to briefly explain how to set up the optimizers to use (0,0) as a starting point for the updates. To do that, after you have created and compiled the model, you can set the weights of the model with
model.set_weights([np.array([w0_start]).reshape(1,1),np.array([w1_start]).reshape(1,)])

where w0_start = 2.0 and w1_start = 0.5.

After doing that, you need to save the value of the weights after each iteration. The easiest way to do that is to implement your update as a custom training loop by using GradientTape(). You will find the entire code online, but your training loops will look like this
for epoch in range(200):
    with tf.GradientTape() as tape:
        # Run the forward pass of the layer.
        ypred = model(x_, training=True)
        loss_value = loss_fn(y_, ypred)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    w1_rmsprop_list.append(float(model.get_weights()[0][0]))
    w0_rmsprop_list.append(float(model.get_weights()[1][0]))

where w1_rmsprop_list and w0_rmsprop_list are lists that contain the values of the weights for each iteration. In this way, you can test any other optimizer you are interested in.

Which Optimizer Should You Use?

To make the story short, you should use Adam , as it is generally considered faster and better than the other methods. That does not mean that this is always the case. There are recent research papers that indicate how such optimizers could generalize poorly on new datasets (check out, for example, https://goo.gl/Nzc8bQ). And there are other papers that simply use GD with a dynamical learning rate decay. It mostly depends on your problem. But generally, Adam is a very good starting point.

Note

If you are unsure which optimizer to start with, use Adam. It’s generally considered faster and better than the other methods.

To give you an idea of how good it can be, let’s apply it to the Zalando dataset. We will use a network with four hidden layers, each with 20 neurons. The model we use is the one discussed at the end of Chapter 3. Figure 5-8 shows that the cost function converges faster when using Adam optimization in comparison to GD. Additionally, in 100 epochs, GD reaches an accuracy of 86%, while Adam reaches 90%. Note that we have not changed anything in the model except the optimizer!
Figure 5-8

The cost function for the Zalando dataset for a network with four hidden layers, each with 20 neurons. The continous line is plain GD with a learning rate of γ = 0.01, and the dashed line is Adam optimization with γ = 0.1, β1 = 0.9, β2 = 0.999, and ϵ = 10−8

As suggested, when testing complex networks on big datasets, the Adam optimizer is a good place to start. But you should not limit your tests to this optimizer alone. A test of other methods is always worthwhile. Maybe for your problem other approaches will work better!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.107.142