Multilayer perceptron networks

Multilayer neural networks are models that chain many neurons in order to create a neural architecture. Individually, neurons are very basic units, but when organized together, we can create a model significantly more powerful than the individual neurons.

As touched upon in the previous section, we build neural networks in layers and we distinguish between different kinds of neural networks primarily on the basis of the connections that exist between these layers and the types of neurons used. The following diagram shows the general structure of a multilayer perceptron (MLP) neural network, shown here for two hidden layers:

Multilayer perceptron networks

The first characteristic of the MLP network is that the information flows in a single direction from input layer to output layer. Thus, it is known as a feedforward neural network. This is in contrast to other neural network types, in which there are cycles that allow information to flow back to earlier neurons in the network as a feedback signal. These networks are known as feedback neural networks or recurrent neural networks. Recurrent neural networks are generally very difficult to train and often do not scale well with the number of inputs. Nonetheless, they do find a number of applications, in particular with problems involving a time component such as forecasting and signal processing.

Returning to the MLP architecture shown in the diagram, we note that the first group of neurons on the left are known as the input neurons and form the input layer. We always have as many input neurons as there are input features. The input neurons are said to produce the values of our input features as outputs. For this reason, we often don't refer to them as input neurons, but rather as input sources or input nodes. At the far right of the diagram, we have the output layer with the output neurons. We usually have as many output neurons as outputs that we are modeling. Thus, our neural network can naturally learn to predict more than one thing at a time. One exception to this rule is that when we are modeling a multiclass classification problem, we usually have one binary output neuron for every class. In this case, all the output neurons are a dummy encoding of a single multiclass factor output.

Between the input and output layers, we have the hidden layers. Neurons are organized into layers depending on how many neurons are between them and an input neuron. For example, neurons in the first hidden layer are directly connected to at least one neuron in the input layer, whereas neurons in the second hidden layer are directly connected to one or more neurons in the first hidden layer. Our diagram is an example of a 4-4 architecture, which means that there are two hidden layers with four neurons each. Even though they are not neurons themselves, the diagram explicitly shows the bias units for all the neurons. We saw in our equation for the output of a single neuron that we can treat the bias unit as a dummy input feature with a value of 1 that has a weight on it that corresponds to the bias or threshold.

Not all the neurons in the architecture are assumed to have the same activation function. In general, we pick the activation function for the neurons in the hidden layers separately from that of the output layer. The activation function for the output layer we've already seen is chosen based on what type of output we would like, which in turn depends on whether we are performing regression or classification.

The activation function for the hidden layer neurons is generally nonlinear, because chaining together linear neurons can be algebraically simplified to a single linear neuron with different weights and so this does not add any power to the network. The most common activation function is the logistic function, but others such as the hyperbolic tangent function are also used.

The output of the neural network can be calculated by successively computing the outputs of the neurons of each layer. The output of the units of the first hidden layer can be computed using the equations for the output of a neuron that we have seen thus far. These outputs become inputs to the neurons of the second hidden layer and thus, are effectively the new features with respect to that layer.

One of the strengths of neural networks is this power to learn new features through the learning of weights in the hidden layers. This process repeats for every layer in the neural network until the final layer, where we obtain the output of the neural network as a whole. This process of propagating the signals from the input to the output layer is known as forward propagation.

Training multilayer perceptron networks

Multilayer perceptron networks are more complicated to train than a single perceptron. The famous algorithm used to train them—that has been around since the 1980s—is known as the backpropagation algorithm. We'll give a sketch of how this algorithm works here, but the reader interested in neural networks is strongly encouraged to read up on this algorithm in more depth.

There are two very important insights to understand about this algorithm. The first is that for every observation, it proceeds in two steps. The forward propagation step begins at the input layer and ends at the output layer, and computes the predicted output of the network for this observation. This is relatively straightforward to do using the equation for the output of each neuron, which is just the application of its activation function on the linear weighted sum of its inputs.

The backward propagation step is designed to modify the weights of the network when the predicted output does not match the desired output. This step begins at the output layer, computing the error on the output nodes and the necessary updates to the weights of the output neurons. Then, it moves backwards through the network, updating the weights of each hidden layer in reverse until it reaches the first hidden layer, which is processed last. Thus, there is a forward pass through the network, followed by a backward pass.

The second important insight to understand is that updating the weights of the neurons in the hidden layer is substantially trickier than updating the weights in the output layer. To see this, consider that when we want to update the weights of neurons in the output layer, we know precisely what the desired output for that neuron should be for a given input. This is because the desired outputs of the output neurons are the outputs of the network itself, which are available to us in our training data. By contrast, at first glance, we don't actually know what the right output of a neuron in a hidden layer should be for a particular input. Additionally, this output is distributed to all the neurons of the next layer in the network and hence impacts all of their outputs as well.

The key insight here is that we propagate the error made in the output neurons back to the neurons in the hidden layers. We do this by finding the gradient of the cost function to adjust the weights of the neurons in the direction of the greatest error reduction and apply the chain rule of differentiation to express this gradient in terms of the output of the individual neuron we are interested in. This process results in a general formula for updating the weights of any neuron in the network, known as the delta update rule:

Training multilayer perceptron networks

Let's understand this equation by assuming that we are currently processing the weights for all the neurons in layer l. This equation tells us how to update the weight between the jth neuron in layer l and the ith neuron in the layer before it (layer l-1). The (n) superscripts all denote the fact that we are currently updating the weight as a result of processing the nth observation in our data set. We will drop these from now on, and assume they are implied.

In a nutshell, the delta rule tells us that to obtain the new value of the neuron weight, we must add a product of three terms to the old value. The first of these terms is the learning rate η. The second is known as the local gradient, δj, and is the product of the error, ej, of neuron j and the gradient of its activation function, g():

Training multilayer perceptron networks

Here, we denote the output of neuron j before applying its activation function by zj, so that the following relation holds:

Training multilayer perceptron networks

It turns out that the local gradient is also the gradient of the cost function of the network computed with respect to zj. Finally, the third term in the delta update rule is the input to neuron j from neuron i, which is just the output of neuron i, yi. The only term that differs between output layer neurons and hidden layer neurons is the local gradient term. We'll see an illustrative example for neural networks that perform classification using logistic neurons throughout. When neuron j is an output neuron, the local gradient is given by:

Training multilayer perceptron networks

The first term in brackets is just the known error of the output neuron, this being the difference between the target output, tj, and the actual output, yj. The other two terms arise from the differentiation of the logistic activation function. When neuron j is a hidden layer neuron, the gradient of the logistic activation function is the same, but the error term is computed as the weighted sum of the local gradients of the k neurons in the next layer that receive input from neuron j:

Training multilayer perceptron networks
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.195.225