Summary

In this chapter, we started the exploration of the deep learning world by introducing the basic concepts that led the first researchers to improve the algorithms until they achieved the top results we have nowadays. The first part explained the structure of a basic artificial neuron, which combines a linear operation followed by an optional non-linear scalar function. A single layer of linear neurons was initially proposed as the first neural network, with the name of the perceptron.

Even though it was quite powerful for many problems, this model soon showed its limitations when working with non-linear separable datasets. A perceptron is not very different from a logistic regression, and there's no concrete reason to employ it. Nevertheless, this model opened the doors to a family of extremely powerful models obtained combining multiple non-linear layers. The multilayer perceptron, which has been proven to be a universal approximator, is able to manage almost any kind of dataset, achieving high-level performances when other methods fail.

In the next section, we analyzed the building bricks of an MLP. We started with the activation functions, describing their structure and features, and focusing on the reasons they lead the choice for specific problems. Then, we discussed the training process, considering the basic idea behind the back-propagation algorithm and how it can be implemented using the stochastic gradient descent method. Even if this approach is quite effective, it can be slow when the complexity of the network is very high. For this reason, many optimization algorithms were proposed. In this chapter, we analyzed the role of momentum and how it's possible to manage adaptive corrections using RMSProp. Then, we combined both, momentum and RMSProp to derive a very powerful algorithm called Adam. In order to provide a complete vision, we also presented two slightly different adaptive algorithms, called AdaGrad and AdaDelta.

In the next sections, we discussed the regularization methods and how they can be plugged into a Keras model. An important section was dedicated to a very diffused technique called dropout, which consists in setting to zero (dropping) a fixed percentage of samples through a random selection. This method, although very simple, prevents the overfitting of very deep networks and encourages the exploration of different regions of the sample space, obtaining a result not very dissimilar to the ones analyzed in Chapter 8, Ensemble Learning. The last topic was the batch normalization technique, which is a method to reduce the mean and variance shift (called covariate shift) caused by subsequent neural transformations. This phenomenon can slow down the training process as each layer requires different adaptations and it's more difficult to move all the weights in the best direction. Applying batch normalization means very deep networks can be trained in a shorter time, thanks also to the possibility of employing higher learning rates.

In the next chapter, we are going to continue this exploration, analyzing very important advanced layers like convolutions (that achieve extraordinary performances in image-oriented tasks) and recurrent units (for the processing of time series) and discussing some practical applications that can be experimented on and readapted using Keras and Tensorflow.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary