Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Training an artificial neural network

Now that we have seen a neural network in action and have gained a basic understanding of how it works by looking over the code, let's dig a little bit deeper into some of the concepts, such as the logistic cost function and the backpropagation algorithm that we implemented to learn the weights.

Computing the logistic cost function

The logistic cost function that we implemented as the _get_cost method is actually pretty simple to follow since it is the same cost function that we described in the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.

Here, is the sigmoid activation of the th unit in one of the layers which we compute in the forward propagation step:

Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):

Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization term. By adding the L2 regularization term to our logistic cost function, we obtain the following equation:

Since we implemented an MLP for multi-class classification, this returns an output vector of elements, which we need to compare with the dimensional target vector in the one-hot encoding representation. For example, the activation of the third layer and the target class (here: class 2) for a particular sample may look like this:

Thus, we need to generalize the logistic cost function to all activation units in our network. So our cost function (without the regularization term) becomes:

Here, the superscript is the index of a particular sample in our training set.

The following generalized regularization term may look a little bit complicated at first, but here we are just calculating the sum of all weights of a layer (without the bias term) that we added to the first column:

The following expression represents the L2-penalty term:

Remember that our goal is to minimize the cost function . Thus, we need to calculate the partial derivative of matrix with respect to each weight for every layer in the network:

In the next section, we will talk about the backpropagation algorithm, which allows us to calculate these partial derivatives to minimize the cost function.

Note that consists of multiple matrices. In a multi-layer perceptron with one hidden layer, we have the weight matrix , which connects the input to the hidden layer, and , which connects the hidden layer to the output layer. An intuitive visualization of the matrix is provided in the following figure:

In this simplified figure, it may seem that both and have the same number of rows and columns, which is typically not the case unless we initialize an MLP with the same number of hidden units, output units, and input features.

If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality of and in more detail in the context of the backpropagation algorithm.

Training neural networks via backpropagation

In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.

In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:

Concisely, we just forward propagate the input features through the connection in the network as shown here:

Training neural networks via backpropagation

In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:

Here, is the vector of the true class labels.

Next, we calculate the error term of the hidden layer:

Here, Training neural networks via backpropagation is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient:

Note that the asterisk symbol means element-wise multiplication in this context.

Note

Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:

To better understand how we compute the term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose of the dimensional matrix ; t is the number of output class labels and h is the number of hidden units. Now, becomes an dimensional matrix with , which is a dimensional vector. We then performed a pair-wise multiplication between and , which is also a dimensional vector. Eventually, after obtaining the terms, we can now write the derivation of the cost function as follows:

Next, we need to accumulate the partial derivative of every th node in layer and the th error of the node in layer :

Remember that we need to compute for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:

After we have accumulated the partial derivatives, we can add the regularization term as follows:

Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:

To bring everything together, let's summarize backpropagation in the following figure:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Training an artificial neural network

Create new playlist

Sign In

Sign Up

Training an artificial neural network

Computing the logistic cost function

Training neural networks via backpropagation

Note

Table of Contents for
Training an artificial neural network