Training algorithms

Until now, we've seen that nodes act like functions, and they are chained together in a network. Another way to think about NNs is that they are nested functions—they really are. Through settling on the best weights, they can approximate almost any function and properly handle the given task; thus, this is why they are called universal function approximators.

Finding the best weights is essentially an optimization task. Nesting and the usual huge amount of parameters make this particular task a little bit challenging. Neural nets only became feasible after the first efficient algorithms came out. Backpropagation was the method that first stood up for the assignment.

Several variations of backpropagation were developed over the years, and if I had to bet, I would say that many more might come out in the near future. These variations are very competitive. Personally, one that I like very much is Adams.

Backpropagation applies gradient descent to the network in a layered manner. Instead of sticking with notations, I may here only give an intuition about how gradient descent and the backpropagation works. Looking into further details surely is worthwhile, especially if you wish to do some research on the subject.

The idea of gradient descent is pretty straightforward. Imagine that you are walking on a function. If you care to know the slope at the point you currently are, you would know which direction to walk to reach greater or lower highs. Walking in the opposite direction of the slope will eventually lead to one of the following outcomes:

You will get very close to a local minimum, which could be a global minimum (or not)
You may walk miles and miles until you get tired and give up

Walking down a function is kind of what gradient descendent does. It will look at the slope at a given point by looking at the derivative of the function in question while walking in the opposite direction. With this approach, gradient descent is likely to arrive at a local minimum, which may or may not be a global minimum, or walk in the same direction for ages until it chooses to terminate, after walking a very long path.

The step size is also something to decide upon. Gradient descent depends on a cost function to work its magic. The quadratic cost is the traditional one, but there is also cross-entropy, categorical cross-entropy, exponential, Itakura-Saito distance, and many others.

This explains why we look for activation functions that can be easily derived and also relates to why the algorithm is likely to get stuck in local optimals if data is not properly scaled. Assuming that there is a global minimum, gradient descendent is not sure to get to it, but as long it gets close enough, the ANN will be fine.

Gradient descendent is the pillar that supports the algorithms used to train NNs. On top of it, all the differences between them are either a requirement to translate such optimization into computational and network contexts or strategies designed to deliver more efficiency by demanding fewer steps to converge, or converging with more accuracy.

Additionally, it's important to state that training algorithms for ANN usually come with their own share of parameters that are very likely to interfere with the training, so it's another thing to keep an eye on.

NNs are a bunch of nested functions. An application of gradient descendent to the neural nets would require that all the partial derivatives with respect to the error for every single weight be computed; that is what backpropagation does, which allows the proper training of NNs.

To compute such partials, backpropagation highly relies on a very popular derivative rule, the chain rule.

Although there are many learning algorithms, backpropagation and some related concepts are common to many of them. The following is a list with a brief explanation of the most popular ones:

Batch: To update the weights using the whole dataset at one time would be computationally impracticable for the majority of networks. What is done instead is the weights are updated by random subsets called mini-batches. Batch-size is usually an accessible hyperparameter in training algorithms.
Epoch: One complete data pass through the network. Many might be required during the training, but too many will cause the network to generalize badly, resulting in overfitting. Setting a validation dataset usually helps to pick the right number of epochs.
Learning rate: It reduces the possibility of overshooting by allowing that only a fraction of the gradient is to be used to update the weights. It's a frequent hyperparameter, and although it requires some testing, common values range from 0.01 to 0.001.

To choose hyperparameters such as batch-size, epochs, and learning rate might be difficult and require some testing and time, but choosing the proper training algorithm may not be that difficult. Adam (Adaptative Moment Estimation) frequently outperforms resilient backpropagation, momentum, Stochastic Gradient Descent (SGD), and Nesterov.

Judging by how fast-paced this research field has proved to be, I bet that Adam will become obsolete someday in the future. A good way to be warned is to frequently read research and stay tuned for updates to your favorite deep learning framework, since really good algorithms are more likely to make their way into it.

Don't ever fear to experience new approaches or try new solutions.

NNs are relatively young and have imposed a pretty fast pace of evolution. Lots of opportunities both from a business and a research perspective are available. At this point, this chapter displays a lengthy list of hyperparameters to look after while designing a NN. To put it on a short list, some decisions that are relatively easy to make prior to training are these:

How many input and output nodes to have
Which activation functions to use
Which training algorithm (strategy) to use

Some others choices may require a little more testing:

How many hidden layers to have
How many hidden nodes to have in each layer
Which types of hidden layers to have
How to select batch-size, epochs, and learning rate

To find fine hyperparameters in the latter list might require some testing, since they highly depend on the specific problem to be solved and the computational power available. The task is not to find the best ever configuration but to find one that will solve your problem in the feasible time.

If you are worried about the large number of decisions you might have to make, don't worry. Experience is a great teacher, and there is much you might learn from experimentation. Before you realize it, you will be cleaning data for eternity, while designing very good networks in the blink of an eye.

While this section was theoretical, the next one will come with a more practical approach. We will see how Keras can be used to build NNs from scratch.

Table of Contents for Training algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Training algorithms