Overcoming vanishing gradient

From the preceding explanation of vanishing gradient, it comes out that the root cause of this problem is the sigmoid function being picked as an activation function. The similar problem has been detected when tanh is chosen as an activation function.

In order to counter such a scenario, the ReLU function comes to the rescue:

ReLU(x)= max(0,x)

If the input is negative or less than zero, the function outputs as zero. In the second scenario, if the input is greater than zero, then the output will be equal to input.

Let's take the derivative of this function and see what happens:

Case 1: x<0:

Case 2: x>0:

If we have to plot it, we get the following:

So, the derivative of ReLU is either 0 or 1. The plot comes out to be like a step function. Now, we can see that we won't face the vanishing gradient problem as the value of the derivative doesn't lie between 0 and 1.

However, it's still not true. We might still face this problem when the input value happens to be negative and we know that derivative turns out to be zero in this scenario. Typically, it doesn't happen that the weighted sum ends up negative, and we can indeed initialize weights to be only positive and/or normalize input between 0 and 1, if we are concerned about the chance of an issue like this occurring.

There is still a workaround for this kind of scenario. We have got another function called Leaky ReLU, which appears as the following formula:

RELU (x) = max (εx, x)

Here, the value of ε is typically 0.2–0.3. We could plot it, as follows:

Table of Contents for Overcoming vanishing gradient

Create new playlist

Sign In

Sign Up

Table of Contents for
Overcoming vanishing gradient