Defining Gradient Descent 

The purpose of training a neural network model is to find the right values for weights. We start training a neural network with random or default values for the weights. Then, we iteratively use an optimizer algorithm, such as gradient descent, to change the weights in such a way that our predictions improve.

The starting point of a gradient descent algorithm is the random values of weights that need to be optimized as we iterate through the algorithm. In each of the subsequent iterations, the algorithm proceeds by changing the values of the weights in such a way that the cost is minimized. 

The following diagram explains the logic of the gradient descent algorithm:

In the preceding diagram, the input is the feature vector X. The actual value of the target variable is Y and the predicted value of the target variable is Y'. We determine the deviation of the actual value from the predicted values. We update the weights and repeat the steps until the cost is minimized.

 How to vary the weight in each iteration of the algorithm will depend on the following two factors:

  • Direction: Which direction to go in to get the minimum of the loss function
  • Learning Rate: How big the change should be in the direction we have chosen 

A simple iterative process is shown in the following diagram:

The diagram shows how, by varying the weights, gradient descent tries to find the minimum cost. The learning rate and chosen direction will determine the next point on the graph to explore. 

Selecting the right value for the learning rate is important. If the learning rate is too small, the problem may take a lot of time to converge. If the learning rate is too high, the problem will not converge. In the preceding diagram, the dot representing our current solution will keep oscillating between the two opposite lines of the graph.

Now, let's see how to minimize a gradient. Consider only two variables, x and y. The gradient of x and y is calculated as follows:

To minimize the gradient, the following approach can be used:

while(gradient!=0):
if (gradient < 0); move right
if (gradient > 0); move left

This algorithm can also be used to find the optimal or near-optimal values of weights for a neural network.  

Note that the calculation of gradient descent proceeds backward throughout the network. We start by calculating the gradient of the final layer first, and then the second-to-last one, and then the one before that, until we reach the first layer. This is called backpropagation, which was introduced by Hinton, Williams, and Rumelhart in 1985. 

Next, let's look into activation functions. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.242.157