Regularization and dropout

Overfitting is a common issue in deep models. Their extremely high capacity can often become problematic even with very large datasets because the ability to learn the structure of the training set is not always related to the ability to generalize. A deep neural network can easily become an associative memory, but the final internal configuration couldn't be the most suitable to manage samples belonging to the same distribution but was never presented during the training process. It goes without saying that this behavior is proportional to the complexity of the separation hypersurface. A linear classifier has a minimum chance to overfit, and a polynomial classifier is incredibly more prone to do it. A combination of hundreds, thousands, or more non-linear functions yields a separation hypersurface, which is beyond any possible analysis. In 1991, Hornik (in Approximation Capabilities of Multilayer Feedforward Networks,Hornik K., Neural Networks, 4/2) generalized a very important result obtained two years before by the mathematician Cybenko (and published in Approximations by Superpositions of Sigmoidal Functions, Cybenko G., Mathematics of Control, Signals, and Systems, 2 /4). Without any mathematical detail (which is, however, not very complex), the theorem states that an MLP (not the most complex architecture!) can approximate any function that is continuous in a compact subset of ℜⁿ. It's clear that such a result formalized what almost any researcher already intuitively knew, but its power goes beyond the first impact, because the MLP is a finite system (not a mathematical series) and the theorem assumes a finite number of layers and neurons. Obviously, the precision is proportional to the complexity; however, there are no unacceptable limitations for almost any problem. However, our goal is not learning an existing continuous function, but managing samples drawn from an unknown data generating process with the purpose to maximize the accuracy when a new sample is presented. There are no guarantees that the function is continuous or that the domain is a compact subset.

In Chapter 1, Machine Learning Models Fundamentals, we have presented the main regularization techniques based on a slightly modified cost function:

The additional term g(θ) is a non-negative function of the weights (such as L2 norm) that forces the optimization process to keep the parameters as small as possible. When working with saturating functions (such as tanh), regularization methods based on the L2 norm try to limit the operating range of the function to the linear part, reducing de facto its capacity. Of course, the final configuration won't be the optimal one (that could be the result of an overfitted model) but the suboptimal trade-off between training and validation accuracy (alternatively, we can say between bias and variance). A system with a bias close to 0 (and a training accuracy close to 1.0) could be extremely rigid in the classification, succeeding only when the samples are very similar to ones evaluated during the training process. That's why this price is often paid considering the advantages obtained when working with new samples. L2 regularization can be employed with any kind of activation function, but the effect could be different. For example, ReLU units have an increased probability to become linear (or constantly null) when the weights are very large. Trying to keep them close to 0.0 means forcing the function to exploit its non-linearity without the risk of extremely large outputs (that can negatively affect very deep architectures). This result can sometimes be more useful, because it allows training bigger models in a smoother way, obtaining better final performances. In general, it's almost impossible to decide whether a regularization can improve the result without several tests, but there are some scenarios where it's very common to introduce a dropout (we discuss this approach in the next paragraph) and tune up its hyperparameter. This is more an empirical choice than a precise architectural decision because many real-life examples (including state-of-the-art models) obtained outstanding results employing this regularization technique. I suggest the reader prefer a rational skepticism to blind trust and double-checking its models before picking a specific solution. Sometimes, an extremely high-performing network turns to being ineffective when a different (but analogous) dataset is chosen. That's why testing different alternatives can provide the best experience in order to solve specific problem classes.

Before moving on, I want to show how it's possible to implement an L1 (useful to enforce sparsity), L2, or ElasticNet (the combination of L1 and L2) regularization using Keras. The framework provides a fine-grained approach that allows imposing a different constraint to each layer. For example, the following snippet shows how to add a l2 constraint with the strength parameter set to 0.05 to a generic fully connected layer:

from keras.layers import Dense
from keras.regularizers import l2

...

model.add(Dense(128, kernel_regularizer=l2(0.05)))

The keras.regularizers package contains the functions l1(), l2(), and l1_l2(), which can be applied to Dense and convolutional layers (we're going to discuss them in the next chapter). These layers allow us to impose a regularization on the weights (kernel_regularizer), on the bias (bias_regularizer), and on the activation output (activation_regularizer), even if the first one is normally the most widely employed.

Alternatively, it's possible to impose specific constraints on the weights and biases that in a more selective way. The following snippet shows how to set a maximum norm (equal to 1.5) on the weights of a layer:

from keras.layers import Dense
from keras.constraints import maxnorm

...

model.add(Dense(128, kernel_constraint=maxnorm(1.5)))

Keras, in the keras.constraints package, provides some functions that can be used to impose a maximum norm on the weights or biases maxnorm(), a unit norm along an axis unit_norm(), non-negativity non_neg(), and upper and lower bounds for the norm min_max_norm(). The difference between this approach and regularization is that it is applied only if necessary. Considering the previous example, imposing an L2 regularization always has an effect, while a constraint on the maximum norm is inactive until the value is lower than the predefined threshold.

Table of Contents for Regularization and dropout

Create new playlist

Sign In

Sign Up

Table of Contents for
Regularization and dropout