Batch normalization

Let's consider a mini-batch of k samples:

Before traversing the network, we can measure a mean and a variance:

After the first layer (for simplicity, let's suppose that the activation function, f(•), is the always the same), the batch is transformed into the following:

In general, there's no guarantee that the new mean and variance are the same. On the contrary, it's easy to observe a modification that increases throughout the network. This phenomenon is called covariate shift, and it's responsible for a progressive training speed decay due to the different adaptations needed in each layer. Ioffe and Szegedy (in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe S., Szegedy C., arXiv:1502.03167 [cs.LG]) proposed a method to mitigate this problem, which has been called batch normalization (BN).

The idea is to renormalize the linear output of a layer (before or after applying the activation function), so that the batch has null mean and unit variance. Therefore, the first task of a BN layer is to compute:

Then each sample is transformed into a normalized version (the parameter δ is included to improve the numerical stability):

However, as the batch normalization has no computational purposes other than speeding up the training process, the transformation must always be an identity (in order to avoid to distort and bias the data); therefore, the actual output will be obtained by applying the linear operation:

The two parameters α^(j) and β^(j) are variables optimized by the SGD algorithm; therefore, each transformation is guaranteed not to alter the scale and the position of data. These layers are active only during the training phase (like dropout), but, contrary to other algorithms, they cannot be simply discarded when the model is used to make predictions on new samples because the output would be constantly biased. To avoid this problem, the authors suggest approximating both mean and variance of X by averaging over the batches (assuming that there are N_b batches with k samples):

Using these values, the batch normalization layers can be transformed into the following linear operations:

It's not difficult to prove that this approximation becomes more and more accurate when the number of batches increases and that the error is normally negligible. However, when the batch size is very small, the statistics can be quite inaccurate; therefore, this method should be used considering the representativeness of a batch. If the data generating process is simple, even a small batch can be enough to describe the actual distribution. When, instead, p_data is more complex, batch normalization requires larger batches to avoid wrong adjustments (a feasible strategy is to compare global mean and variance with the ones computed sampling some batches and trying to set the batch size that minimizes the discrepancy). However, this simple process can dramatically reduce the covariate shift and improve the convergence speed of very deep networks (including the famous residual networks). Moreover, it allows employing higher learning rates as the layers are implicitly saturated and can never explode. Additionally, it has been proven that batch normalization has also a secondary regularization effect even if it doesn't work on the weights. The reason is not very different from the one proposed for L2, but, in this case, there's a residual effect due to the transformation itself (partially caused by the variability of the parameters α^(j) and β^(j)) that can encourage the exploration of different regions of the sample space. However, this is not the primary effect, and it's not a good practice employing this method as a regularizer.

Table of Contents for Batch normalization

Create new playlist

Sign In

Sign Up

Table of Contents for
Batch normalization