Controlling variance with regularization

Regularization is another way to control overfitting, that penalizes individual weights in the model as they grow larger. If you're familiar with linear models such as linear and logistic regression, it's exactly the same technique applied at the neuron level. Two flavors of regularization, called L1 and L2, can be used to regularize neural networks. However, because it is more computationally efficient L2 regularization is almost always used in neural networks.

Quickly, we need to first regularize our cost function. If we imagine C0categorical cross-entropy, as the original cost function, then the regularized cost function would be as follows:

Here, ; is a regularization parameter that can be increased or decreased to change the amount of regularization applied. This regularization parameter penalizes big values for weights, resulting in a network that hopefully has smaller weights overall.

For a more in-depth coverage of regularization in neural networks, check out Chapter 3 of Michael Nielsen's Neural Networks and Deep Learning at http://neuralnetworksanddeeplearning.com/chap3.html.

Regularization can be applied to the weights, biases, and activations in a Keras layer. I'll demonstrate this technique using L2, with the default parameters. In the following example I've applied regularization to each hidden layer:

def build_network(input_features=None):
# first we specify an input layer, with a shape == features
inputs = Input(shape=(input_features,), name="input")
x = Dense(512, activation='relu', name="hidden1", kernel_regularizer='l2')
(inputs)
x = Dense(256, activation='relu', name="hidden2", kernel_regularizer='l2')(x)
x = Dense(128, activation='relu', name="hidden3", kernel_regularizer='l2')(x)
prediction = Dense(10, activation='softmax', name="output")(x)
model = Model(inputs=inputs, outputs=prediction)
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=["accuracy"])
return model

So, let's compare default L2 regularization to our other two models. The following figure shows the comparison:

Our new L2 regularized network is unfortunately easy to spot. In this case, it seems that L2 regularization works a little too well. Our network is now high bias and hasn't learned as much as the other two.

If I were really determined to use regularization for this problem, I would start by changing the regularization rate and attempting to find a more suitable value, but we're so far off I'm skeptical that we will be successful in doing better than our dropout model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.255