Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

U. MichelucciApplied Deep Learning with TensorFlow 2https://doi.org/10.1007/978-1-4842-8020-1_5

5. Advanced Optimizers

Umberto Michelucci¹

(1)

Dübendorf, Switzerland

In general, optimizers are algorithms that minimize a given function. Remember that training a neural network means simply minimizing the loss function. Chapter 1 looked at the gradient descent optimizer and its variations (mini-batch and stochastic). In this chapter, we look at more advanced and efficient optimizers. We look in particular at Momentum, RMSProp, and Adam. We cover the mathematics behind them and then explain how to implement and use them in Keras.

Available Optimizers in Keras in TensorFlow 2.5

TensorFlow has evolved a lot in the last few years. In the beginning, you could find gradient descent as a class, but it is not available anymore (not directly). Keras offers as of TensorFlow version 2.5 only advanced algorithms. In particular, you will find the following: Adadelta, Adagrad, Adam. Adamax, Ftrl, Nadam, RMSProp, and SGD (gradient descent with momentum). So plain gradient descent is not available anymore out of the box from Keras.

Note

In general, you should determine which optimizers work best for your problem, but if you don’t know where to start, Adam is always a good choice.

Advanced Optimizers

Up to now we have only discussed how gradient descent works. That is not the most efficient optimizer (although it’s the easiest to understand), and there are some modifications to the algorithm that can make it faster and more efficient. This is a very active area of research, and you will find an incredible number of algorithms based on different ideas. This chapter looks at the most instructive and famous ones: Momentum, RMSProp, and Adam.

Additional material that investigates the more exotic optimizers has been written by S. Ruder in a paper called “An overview of gradient descent optimization algorithms” (see https://goo.gl/KgKVgG). The paper is not for beginners and requires a strong mathematical background, but it provides an overview of the more exotic algorithms, including Adagrad, Adadelta, and Nadam. Additionally, it reviews weight-update schemes applicable in distributed environments like Hogwild!, Downpour SGD, and many more. Surely a read worth your time.

To understand the basic idea of momentum (and partially also of RMSProp and Adam), you first need to understand what exponentially weighted averages are. So, let’s start with some mathematics.

Exponentially Weighted Averages

Let’s suppose you are measuring a quantity θ (it could be the temperature where you live, for example) over time—once a day, for example. You will have a series of measurements that you can indicate with θ_i, where i goes from 1 to a certain number N. Now bear with me if this does not make much sense at first. Let’s define recursively a quantity v_n, as follows

${displaystyle egin{array}{l}{v}_0=1\ {}{v}_1=eta {v}_0+left(1-eta ight){ heta}_1\ {}{v}_2=eta {v}_1+left(1-eta ight){ heta}_2end{array}}$

and so on with β a real number and 0 < β < 1. Generally, we could write the n^th term with a recursive formula, as follows

${v}_n=eta {v}_{n-1}+left(1-eta ight){ heta}_n$

Now let’s write all the terms v₁, v₂ and so on as a function of β and θ_i (so not recursively). For v₂ we have

${v}_2=eta left(eta {v}_0+left(1-eta ight){ heta}_1 ight)+left(1-eta ight){ heta}_2={eta}^2+left(1-eta ight)left(eta { heta}_1+{ heta}_2 ight)$

For v₃ we have

${v}_3={eta}^3+left(1-eta ight)left[{eta}^2{ heta}_1+eta { heta}_2+{ heta}_3 ight]$

Generalizing, we obtain

${v}_n={eta}^n+left(1-eta ight)left[{eta}^{n-1}{ heta}_1+{eta}^{n-2}{ heta}_2+dots +{ heta}_n ight]$

Or in a more elegant way (without the three dots), we get

${v}_n={eta}^n+left(1-eta ight)sum limits_{i=1}^n{eta}^{n-i}{ heta}_i$

Now let’s try to understand what this formula means. First of all, note that the term βⁿ disappears when you choose v₀ = 0. Let’s do that (set v₀ = 0) and consider what remains

${v}_n=left(1-eta ight)sum limits_{i=1}^n{eta}^{n-i}{ heta}_i$

Are you still with me? Now comes the interesting part. Let’s define the convolution between two sequences.¹ Let’s consider two sequences, x_n and h_n. The convolution between the two (which we indicate with the symbol ∗) is defined by

${x}_nast {h}_n=sum limits_{k=-infty}^{infty }{x}_k{h}_{n-k}$

Since we only have a finite number of measurements for our quantity θ_i, we can define

${ heta}_k=0kern1.5em k>n,kle 0$

Therefore, we can write v_n as a convolution as follows

${v}_n={ heta}_nast {b}_n$

Where we have defined

${b}_n=left(1-eta ight){eta}^n$

To get an idea of what that means, let’s plot θ_n, b_n, and v_n. To do that, let’s assume θ_n has a Gaussian shape (the exact form is not relevant, is just for illustrative purposes), and let’s use β = 0.9 (see Figure 5-1).

Figure 5-1
The left side is a plot showing θ_n (solid line) and b_n (dotted line) together. The right side is a plot showing the points that need to be summed to obtain v_n for n = 50

Let’s discuss briefly Figure 5-1. The Gaussian curve (θ_n) will be convoluted with b_n to obtain v_n. The result can be seen in the right plot. All those terms (1 − β)β^n − iθ_i for i = 1, …, 50 (plotted in the right plot) will be summed to obtain v₅₀. Note that v_n is the average of all θ_n for n = 1, …, 50. Each term is multiplied by a term (b_n); that is, 1 for n = 50 and then this decreases rapidly for n decreasing toward 1. Basically, this is a weighted average, with an exponentially decreasing weight. The terms farther from n = 50 are less and less relevant, while the terms close to n = 50 get more weight. This is also a moving average. For each n, all the preceding terms are added and multiplied by a weight (b_n).

I would like now to show you why there is this factor 1 − β in b_n. Why not just choose βⁿ? The reason is very simple. The sum of b_n over all positive ns is equal to 1. Let’s see why.

$sum limits_{k=1}^{infty }{b}_k=left(1-eta ight)sum limits_{k=1}^{infty }{eta}^n=left(1-eta ight)underset{N o infty }{lim}frac{1-{eta}^{N+1}}{1-eta }=left(1-eta ight)frac{1}{1-eta }=1$

We used the fact that for β < 1 we have $underset{N o infty }{lim }{eta}^{N+1}=0$ and that for a geometric series we have

$sum limits_{k=1}^na{r}^{k-1}=frac{aleft(1-{r}^n ight)}{1-r}$

The algorithm we described to calculate v_n is nothing else than the convolution of our quantity θ_i, with a series that has a sum equal to one and has the form (1 − β)βⁱ.

Note

The exponentially weighted average v_n of a series of a quantity θ_n is the convolution v_n = θ_n ∗ b_n of our quantity θ_i with b_n = (1 − β)βⁿ, where b_n has the property that its sum over the positive values of n is equal to 1. It is the moving average, where each term is multiplied by the weights given by the sequence b_n.

As you choose smaller and smaller β values, the number of points θ_n that have a weight that’s significantly different than zero decreases, as you can see in Figure 5-2, where we have plotted the series b_n for different values of β.

Figure 5-2
The series b_n for three values of β: 0.9, 0.8, and 0.3. Note that as β gets smaller, the series is significantly different than zero for an increasingly smaller number of values

This method is at the very core of the Momentum optimizer and more advanced learning algorithms, and you will see in the next sections how it works in practice.

Momentum

Recall that in plain gradient descent the weight updates are calculated with these equations

$left{egin{array}{l}{oldsymbol{w}}_{left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-gamma { abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)\ {}{b}_{left[n+1 ight]}={b}_{left[n ight]}-gamma frac{partial Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)}{partial b}end{array} ight.$

Where the subscript in square brackets indicates the iteration. The idea behind the Momentum optimizer is to use exponentially weighted averages of the corrections of the gradient for the weight updates. Mathematically, we formulate the previous statement as follows

$left{egin{array}{l}{oldsymbol{v}}_{w,left[n+1 ight]}=eta {oldsymbol{v}}_{w,left[n ight]}+left(1-eta ight){ abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)\ {}{v}_{b,left[n+1 ight]}=eta {v}_{b,left[n ight]}+left(1-eta ight)frac{partial Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)}{partial b}end{array} ight.$

Then perform the weight and bias updates with these equations

$left{egin{array}{l}{oldsymbol{w}}_{left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-gamma {oldsymbol{v}}_{w,left[n ight]}\ {}{b}_{left[n+1 ight]}={b}_{left[n ight]}-gamma {v}_{b,left[n ight]}end{array} ight.$

Where v_{w, [0]} = 0 and v_{b, [0]} = 0 are usually chosen. Now that means, instead of using the derivatives of the cost functions with respect to the weights, we update the weights with a moving average of the derivatives. Usually, experience shows that a bias correction could theoretically be neglected.

Note

The Momentum algorithm uses an exponential weighted average of the derivates of the cost function with respect to the weights for the weight updates. This way, in addition to the derivatives at a given iteration being used, the past behavior is also considered. It may happen that the algorithm oscillates around the minimum instead of converging directly. This algorithm can escape from plateaus much more efficiently than standard gradient descent.

Sometimes you find a slightly different formulation in books or blogs, which is (we report only the equation for the weights w for brevity)

${oldsymbol{v}}_{w,left[n+1 ight]}=gamma {oldsymbol{v}}_{w,left[n ight]}+eta { abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)$

The idea and meaning remain the same; it’s simply a slightly different mathematical formulation. We find that the method described is easier to understand with the notion of sequence convolution and of weighted averages than this second formulation. Another formulation that you will find (the one that TensorFlow uses) is

${v}_{w,left[n+1 ight]}={eta}^t{v}_{w,left[n ight]}+{ abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)$

Where η^t is called by TensorFlow momentum (the superscript t indicates that this variable is used by TensorFlow). In this formulation, the weight update assumes the form

${oldsymbol{w}}_{left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-{gamma}^t{v}_{w,left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-{gamma}^tleft({eta}^t{v}_{w,left[n ight]}+{ abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight) ight)={oldsymbol{w}}_{left[n ight]}-{gamma}^t{eta}^t{v}_{w,left[n ight]}-{gamma}^t{ abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)$

Where again the superscript t indicates that the variable is the one used by TensorFlow. Although it seems different, this formulation is equivalent to the formulation shown earlier:

${oldsymbol{w}}_{left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-gamma eta {v}_{w,left[n ight]}-gamma left(1-eta ight){ abla}_{mathbf{w}}Lleft({oldsymbol{w}}_{left[n ight]},{b}_{left[n ight]} ight)$

The TensorFlow formulation and the one discussed earlier are equivalent if we choose

$left{egin{array}{l}eta =frac{eta }{1-eta}\ {}{gamma}^t=gamma left(1-eta ight)end{array} ight.$

This can be seen by simply comparing the two different equations for the weight updates. Values around η = 0.9 in TensorFlow implementations are used and they typically work well. The momentum almost always converges faster than with plain gradient descent .

Note

Comparing the different parameters in the different optimizers is wrong. When talking about the learning rate, for example, it has a different meaning in the different algorithms. What you should compare is the best convergence speed you can achieve with several optimizers, regardless of the choice of parameters. Comparing the GD for a learning rate of 0.01 with Adam (you will see it later) for the same learning rate does not make much sense. You should compare the optimizers with the parameters to see which gives you the best and fastest convergence.

RMSProp

Let’s move on to something a bit more complex, but usually more efficient. Let’s look at the mathematical equations and then we will compare them to the others we have seen so far. At each iteration, we need to calculate

$left{egin{array}{l}{oldsymbol{S}}_{oldsymbol{w},left[n+1 ight]}={eta}_2{oldsymbol{S}}_{oldsymbol{w},left[n ight]}+left(1-{eta}_2 ight){ abla}_{oldsymbol{w}}Jleft(oldsymbol{w},b ight)circ { abla}_{oldsymbol{w}}Lleft(oldsymbol{w},b ight)\ {}{S}_{b,left[n+1 ight]}={eta}_2{S}_{b,left[n ight]}+left(1-{eta}_2 ight)frac{partial Lleft(w,b ight)}{partial b}circ frac{partial Lleft(w,b ight)}{partial b}end{array} ight.$

where the symbol ∘ indicates an element-wise product. Then we will update our weights with the equations

$left{egin{array}{l}{oldsymbol{w}}_{left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-frac{gamma { abla}_{oldsymbol{w}}Lleft(oldsymbol{w},b ight)}{sqrt{{oldsymbol{S}}_{oldsymbol{w},left[n+1 ight]}+epsilon }} \ {}{b}_{left[n+1 ight]}={b}_{left[n ight]}-gamma frac{partial Lleft(w,b ight)}{partial b}frac{1}{sqrt{S_{b,left[n ight]}+epsilon }}end{array} ight.$

So first you do an exponential weighted average of the quantities S_{w, [n + 1]} and S_{b, [n + 1]} and then you use them to modify the derivates that you use to do your weight updates. The ϵ, which is usually ϵ = 10⁻⁸, is there to prevent the denominator from going to 0 when the quantities S_{w, [n + 1]} and S_{b, [n + 1]} go to 0. The idea is that if the derivative is big, the S quantities are big. Therefore, the factors $1/sqrt{{oldsymbol{S}}_{oldsymbol{w},left[n+1 ight]}+epsilon }$ and $1/sqrt{S_{b,left[n ight]}+epsilon }$ will be smaller and therefore the learning will slow down. The other way around is also true, so if the derivatives are small, the learning will be faster. This algorithm will make the learning faster for the parameters that are slowing it down. In TensorFlow is particular easy to use it simply with this code:

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.1)

Adam

The last algorithm we look at is called Adam (Adaptive Moment estimation) . It combines the ideas of RMSProp and Momentum into one optimizer. Like Momentum, it uses an exponential weighted average of past derivatives and, like RMSProp, it uses the exponentially weighted averages of past squared derivatives.

You need to calculate the same quantities that you need for Momentum and for RMSProp and then calculate the following quantities

${displaystyle egin{array}{c}{oldsymbol{v}}_{oldsymbol{w},left[n ight]}^{corrected}=frac{{oldsymbol{v}}_{oldsymbol{w},left[n ight]}}{1-{eta}_1^n}\ {}{v}_{b,left[n ight]}^{corrected}=frac{v_{b,left[n ight]}}{1-{eta}_1^n}end{array}}$

And in the same way

${displaystyle egin{array}{c}{oldsymbol{S}}_{oldsymbol{w},left[n ight]}^{corrected}=frac{{oldsymbol{S}}_{oldsymbol{w},left[n ight]}}{1-{eta}_2^n}\ {}{S}_{b,left[n ight]}^{corrected}=frac{S_{b,left[n ight]}}{1-{eta}_2^n}end{array}}$

Where we use β₁ for the hyper-parameter, used in Momentum, and β₂ for the one we used in RMSProp. Then, similar to what we did in RMSProp, we update our weights with the equations

$left{egin{array}{l}{oldsymbol{w}}_{left[n+1 ight]}={oldsymbol{w}}_{left[n ight]}-frac{gamma {oldsymbol{v}}_{oldsymbol{w},left[n ight]}^{corrected}}{sqrt{{oldsymbol{S}}_{oldsymbol{w},left[n+1 ight]}^{corrected}+epsilon }} \ {}{b}_{left[n+1 ight]}={b}_{left[n ight]}-gamma frac{v_{b,left[n ight]}^{corrected}}{sqrt{S_{b,left[n ight]}^{corrected}+epsilon }}end{array} ight.$

TensorFlow does everything for you if you simply use the following line

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07)

where in this case the typical values for the parameters have been chosen: γ = 0.001, β₁ = 0.9, β₂ = 0.999, and ϵ = 10⁻⁷. Note how, since this algorithm adapts the learning rate to the situation, we can start with a bigger learning rate to speed up the convergence.

Comparison of the Optimizers’ Performance

It is interesting to see how the optimizers behave to better understand why, for example, Adam is so efficient. To do that, let’s create a toy problem. Let’s consider a dataset of 30 tuples (x_i, y_i) with

${x}_i=-1+frac{2}{30}i$

And

${y}_i=2+frac{1}{2}{x}_i$

Let’s compare gradient descent, Adam, and RMSProp. The goal of the problem is, starting from the 30 data points, to determine the linear relationships between x_i and y_i. In other words, to determine the two parameters—2 and 1/2—in the last equation. For this example, we can write the linear relation as follows

${y}_i={w}_0+{w}_1{x}_i$

The optimizers will try to minimize the MSE with respect to w₀ and w₁. You will find the entire code to do this comparison in the online version of the book at https://adl.toelt.ai. In the online code, you will see an implementation of the gradient descent from scratch. In Figure 5-3, you can see the path that the GD algorithm takes in parameter space to reach the minimum of the MSE function in 200 iterations.

Figure 5-3
The path that the GD algorithm follows in parameter spaces while minimizing the MSE when starting from (0,0). The number of iterations used for this plot is 200

Figure 5-4 shows the different paths taken by the different optimizers when solving the same problem.

Figure 5-4
The path that the GD algorithm, Adam, and RMSProp optimizers follow in parameter spaces while minimizing the MSE when starting from (0,0). The number of iterations used for this plot is 200

One striking difference in Figure 5-4 is that while the GD path is rather direct, the others tends to be less direct, with Adam making loops around the minimum and RMSProp oscillating when close to the minimum. From the plot, it’s difficult to see which one is faster, and the best way of checking that is to plot ${w}_0^{left[i ight]}$ and ${w}_1^{left[i ight]}$ vs. i, where i indicates the iteration number. In Figure 5-5, you can see how quickly ${w}_0^{left[i ight]}$ converges to the expected value of 2.0 with the different algorithms. In this case, Adam and GD are on par, while RMSProp seems to be slower.

Figure 5-5
The plot of ${w}_0^{left[i ight]}$ vs. the number of iterations with different optimizers

$$ {w}_0^{left[i
ight]} $$ — Figure 5-5
The plot of ${w}_0^{left[i ight]}$ vs. the number of iterations with different optimizers

A different picture emerges when considering ${w}_1^{left[i ight]}$ . Figure 5-6 shows how quickly the different algorithms converge.

Figure 5-6
The plot of ${w}_1^{left[i ight]}$ vs. the number of iterations with different optimizers

In this case, you can see how much faster Adam and RMSProp are in converging. It is interesting to note how Adam oscillates around the expected value of 0.5. This property makes Adam very efficient in escaping areas in parameters’ space, where training can get stuck. But you should notice something else that is very interesting if you zoom in on the picture after iteration 150, as you can see in Figure 5-7.

Figure 5-7
Zoom of Figure 6-5. Plot of ${w}_1^{left[i ight]}$ vs. the number of iterations with different optimizers

In Figure 5-7, you can see how, while GD is still converging, Adam is already at the expected value, and RMSProp oscillates around it, since probably it cannot converge completely. This should convince you how efficient Adam is in comparison with other optimizers.

Small Coding Digression

I want to briefly explain how to set up the optimizers to use (0,0) as a starting point for the updates. To do that, after you have created and compiled the model, you can set the weights of the model with

model.set_weights([np.array([w0_start]).reshape(1,1),np.array([w1_start]).reshape(1,)])

where w0_start = 2.0 and w1_start = 0.5.

After doing that, you need to save the value of the weights after each iteration. The easiest way to do that is to implement your update as a custom training loop by using GradientTape(). You will find the entire code online, but your training loops will look like this

for epoch in range(200):

with tf.GradientTape() as tape:

# Run the forward pass of the layer.

ypred = model(x_, training=True)

loss_value = loss_fn(y_, ypred)

grads = tape.gradient(loss_value, model.trainable_weights)

optimizer.apply_gradients(zip(grads, model.trainable_weights))

w1_rmsprop_list.append(float(model.get_weights()[0][0]))

w0_rmsprop_list.append(float(model.get_weights()[1][0]))

where w1_rmsprop_list and w0_rmsprop_list are lists that contain the values of the weights for each iteration. In this way, you can test any other optimizer you are interested in.

Which Optimizer Should You Use?

To make the story short, you should use Adam , as it is generally considered faster and better than the other methods. That does not mean that this is always the case. There are recent research papers that indicate how such optimizers could generalize poorly on new datasets (check out, for example, https://goo.gl/Nzc8bQ). And there are other papers that simply use GD with a dynamical learning rate decay. It mostly depends on your problem. But generally, Adam is a very good starting point.

Note

If you are unsure which optimizer to start with, use Adam. It’s generally considered faster and better than the other methods.

To give you an idea of how good it can be, let’s apply it to the Zalando dataset. We will use a network with four hidden layers, each with 20 neurons. The model we use is the one discussed at the end of Chapter 3. Figure 5-8 shows that the cost function converges faster when using Adam optimization in comparison to GD. Additionally, in 100 epochs, GD reaches an accuracy of 86%, while Adam reaches 90%. Note that we have not changed anything in the model except the optimizer!

Figure 5-8
The cost function for the Zalando dataset for a network with four hidden layers, each with 20 neurons. The continous line is plain GD with a learning rate of γ = 0.01, and the dashed line is Adam optimization with γ = 0.1, β₁ = 0.9, β₂ = 0.999, and ϵ = 10⁻⁸

As suggested, when testing complex networks on big datasets, the Adam optimizer is a good place to start. But you should not limit your tests to this optimizer alone. A test of other methods is always worthwhile. Maybe for your problem other approaches will work better!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Advanced Optimizers

Create new playlist

Sign In

Sign Up

5. Advanced Optimizers

Available Optimizers in Keras in TensorFlow 2.5

Advanced Optimizers

Exponentially Weighted Averages

Momentum

RMSProp

Adam

Comparison of the Optimizers’ Performance

Small Coding Digression

Which Optimizer Should You Use?

Table of Contents for
5. Advanced Optimizers