Cliffs and exploding gradients

Objective functions for highly non-linear DNNs have extremely steep, regions resembling cliffs, as shown in the following figure. Moving in the direction of a negative gradient of an extremely steep cliff structure can move the weights so far off that we jump off the cliff structure altogether. Thus, missing the minima at a point when we are very close.

Thus, nullifying much of the work that had been done to reach the current solution:

Explaining when we need to clip the gradient norm

We can avoid such bad moves in gradient descent by clipping the gradient, that is, setting an upper bound on the magnitude of the gradient. We recall that the gradient descent is based on the first-order Taylor approximation of the function. This approximation holds good in an infinitesimal region around the point where the gradient is computed. The cost function may begin to increase or curve back upward if we jump outside this region. So, we need to restrict the length of the move. The gradient can still give the approximately correct direction. The update must be chosen to be small enough to avoid traversing too much upward curvature. One way to achieve this is by clipping the norm of the gradient by setting a threshold for the upper bound on the norm:

In Keras, this can be implemented as follows:

#The parameters clipnorm and clipvalue can be used with all optimizers #to control gradient clipping:

from keras import optimizers

# All parameter gradients will be clipped to max norm of 1.0
sgd = optimizers.SGD(lr=0.01, clipnorm=1.)
#Similarly for ADAM optmizer
adam = optimizers.Adam(clipnorm=1.)

Table of Contents for Cliffs and exploding gradients

Create new playlist

Sign In

Sign Up

Table of Contents for
Cliffs and exploding gradients