Optimization algorithms

When discussing the back-propagation algorithm, we have shown how the SGD strategy can be easily employed to train deep networks with large datasets. This method is quite robust and effective; however, the function to optimize is generally non-convex and the number of parameters is extremely large. These conditions increase dramatically the probability to find saddle points (instead of local minima) and can slow down the training process when the surface is almost flat.

A common result of applying a vanilla SGD algorithm to these systems is shown in the following diagram:

Instead of reaching the optimal configuration, θ_opt, the algorithm reaches a sub-optimal parameter configuration, θ_subopt, and loses the ability to perform further corrections. To mitigate all these problems and their consequences, many SGD optimization algorithms have been proposed, with the purpose of speeding up the convergence (also when the gradients become extremely small) and avoiding the instabilities of ill-conditioned systems.

Table of Contents for Optimization algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Optimization algorithms