Deep network performance

In theory, it's clear that the more layers we add to the neuron network, the better it is. This is the case with the green line, as shown in the following graph. As soon as we add more layers, we'll see the error rate go down, ideally to zero. Unfortunately, in practice, we see this happening only partially. It's true that the error rate goes down once we add more layers, but only until a certain point. After this point, adding more layers will basically increase the error rate:

By adding many layers, we add a lot of weights and parameters, making it difficult for the neural network to figure all these weights out. This is one of the reasons why the error rate may go up. Also, there is one more problem, which we saw in the first section: the vanishing gradient. We saw a way to mitigate the problem but not how to solve it; adding more layers will make this problem more obvious and harder to fix. One of the ways to solve this problem is by using residual neural networks:

Let's suppose that we have a normal neural network, with m pixels as the input and k as the output. The k could be 1,000, as for ImageNet, and then we add many hidden layers. These hidden layers could be either convolution or fully-connected layers. For each of the layers, we have activations. Let's recall the activation where the result such as the ReLU was obtained by applying one of the chosen activation function to the sum of multiplied weights with the previous activation functions.
For example, we calculated a2 in a similar manner as a1 that is, the previous activation function was multiplied by the weights of that respective layer and then, on the result, we apply activation functions like ReLU:

For a3, we have the previous activation multiplied by the third weight at that layer:

And for the tenth layer, we have the previous activation multiplied by the weights at the tenth layer.

It is depicted as follows:

Here, a and w—the weights and the activation, respectively—are matrices. The multiplication has a sum in the middle (the definition we said previously), so the sum of weight multiplication and the previous activation functions holds true.

The problem with the vanishing gradient, for example, was that the gradients were so small that it caused this product to be really small, and that made it very hard for the neural network to find good weights. This is why we have this increase in the error rate. Residual networks solved this error by forwarding earlier activations to deeper layers. Suppose it forwards a² to the 10th layer. And by forward, we mean that we simply add a²to this product. Then, notice how the vanishing gradient was solved in this case. Because even if this product is small, a² is big enough, because it comes from the earlier layers. And, at the same time, even if this product was small, it had an impact on the learning process. Like when there is a positive impact, that impact is preserved, which means that it is not lost, since we add that (even if it is small) to the a².

By preserving these small contributions in the deeper layers, residual networks keep improving even if we keep adding layers. In contrast to the normal neural network, where the performance degrades when we add more layers, in residual networks, they keep improving, and that's basically because of these early layers' contributions to the deeper layers. For example, for this a¹¹, it may appear something like the following:

a³ is coming to help a¹¹ to have sufficiently large values, for example, if we are facing the vanishing gradients, and if a¹⁰ multiplied by v¹¹ has any good contributions, we preserve those.

Table of Contents for Deep network performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Deep network performance