Training an artificial neural network

Now that we have seen a neural network in action and have gained a basic understanding of how it works by looking over the code, let's dig a little bit deeper into some of the concepts, such as the logistic cost function and the backpropagation algorithm that we implemented to learn the weights.

Computing the logistic cost function

The logistic cost function that we implemented as the _get_cost method is actually pretty simple to follow since it is the same cost function that we described in the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.

Computing the logistic cost function

Here, Computing the logistic cost function is the sigmoid activation of the Computing the logistic cost functionth unit in one of the layers which we compute in the forward propagation step:

Computing the logistic cost function

Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):

Computing the logistic cost function

Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization term. By adding the L2 regularization term to our logistic cost function, we obtain the following equation:

Computing the logistic cost function

Since we implemented an MLP for multi-class classification, this returns an output vector of Computing the logistic cost function elements, which we need to compare with the Computing the logistic cost function dimensional target vector in the one-hot encoding representation. For example, the activation of the third layer and the target class (here: class 2) for a particular sample may look like this:

Computing the logistic cost function

Thus, we need to generalize the logistic cost function to all activation units Computing the logistic cost function in our network. So our cost function (without the regularization term) becomes:

Computing the logistic cost function

Here, the superscript Computing the logistic cost function is the index of a particular sample in our training set.

The following generalized regularization term may look a little bit complicated at first, but here we are just calculating the sum of all weights of a layer Computing the logistic cost function (without the bias term) that we added to the first column:

Computing the logistic cost function

The following expression represents the L2-penalty term:

Computing the logistic cost function

Remember that our goal is to minimize the cost function Computing the logistic cost function. Thus, we need to calculate the partial derivative of matrix Computing the logistic cost function with respect to each weight for every layer in the network:

Computing the logistic cost function

In the next section, we will talk about the backpropagation algorithm, which allows us to calculate these partial derivatives to minimize the cost function.

Note that Computing the logistic cost function consists of multiple matrices. In a multi-layer perceptron with one hidden layer, we have the weight matrix Computing the logistic cost function, which connects the input to the hidden layer, and Computing the logistic cost function, which connects the hidden layer to the output layer. An intuitive visualization of the matrix Computing the logistic cost function is provided in the following figure:

Computing the logistic cost function

In this simplified figure, it may seem that both Computing the logistic cost function and Computing the logistic cost function have the same number of rows and columns, which is typically not the case unless we initialize an MLP with the same number of hidden units, output units, and input features.

If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality of Computing the logistic cost function and Computing the logistic cost function in more detail in the context of the backpropagation algorithm.

Training neural networks via backpropagation

In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.

In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

Concisely, we just forward propagate the input features through the connection in the network as shown here:

Training neural networks via backpropagation

In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is the vector of the true class labels.

Next, we calculate the error term of the hidden layer:

Training neural networks via backpropagation

Here, Training neural networks via backpropagation is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient:

Training neural networks via backpropagation

Note that the asterisk symbol Training neural networks via backpropagation means element-wise multiplication in this context.

Note

Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:

Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation
Training neural networks via backpropagation

To better understand how we compute the Training neural networks via backpropagation term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose Training neural networks via backpropagation of the Training neural networks via backpropagation dimensional matrix Training neural networks via backpropagation; t is the number of output class labels and h is the number of hidden units. Now, Training neural networks via backpropagation becomes an Training neural networks via backpropagation dimensional matrix with Training neural networks via backpropagation, which is a Training neural networks via backpropagation dimensional vector. We then performed a pair-wise multiplication between Training neural networks via backpropagation and Training neural networks via backpropagation, which is also a Training neural networks via backpropagation dimensional vector. Eventually, after obtaining the Training neural networks via backpropagation terms, we can now write the derivation of the cost function as follows:

Training neural networks via backpropagation

Next, we need to accumulate the partial derivative of every Training neural networks via backpropagationth node in layer Training neural networks via backpropagation and the Training neural networks via backpropagationth error of the node in layer Training neural networks via backpropagation:

Training neural networks via backpropagation

Remember that we need to compute Training neural networks via backpropagation for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:

Training neural networks via backpropagation

After we have accumulated the partial derivatives, we can add the regularization term as follows:

Training neural networks via backpropagation

Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:

Training neural networks via backpropagation

To bring everything together, let's summarize backpropagation in the following figure:

Training neural networks via backpropagation
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.34.62