Understanding the backpropagation algorithm

The backpropagation learning algorithm is used to train a multilayer perceptron ANN from a given set of sample values. In brief, this algorithm first calculates the output value for a set of given input values and also calculates the amount of error in the output of the ANN. The amount of error in the ANN is determined by comparing the predicted output value of the ANN to the expected output value for the given input values from the training data provided to the ANN. The calculated error is then used to modify the weights of the ANN. Thus, after training the ANN with a reasonable number of samples, the ANN will be able to predict the output value for a set of input values. The algorithm comprises of three distinct phases. They are as follows:

  • A forward propagation phase
  • A backpropagation phase
  • A weight update phase

The weights of the synapses in the ANN are first initialized to random values within the ranges Understanding the backpropagation algorithm and Understanding the backpropagation algorithm. We initialize the weights to values within this range to avoid a symmetry in the weight matrices. This avoidance of symmetry is called symmetry breaking, and it is performed so that each iteration of the backpropagation algorithm produces a noticeable change in the weights of the synapses in the ANN. This is desirable in an ANN as each of its node should learn independently of other nodes in the ANN. If all the nodes were to have identical weights, the estimated learning model will be either overfit or underfit.

Also, the backpropagation learning algorithm requires two additional parameters, which are the learning rate Understanding the backpropagation algorithm and the learning momentum Understanding the backpropagation algorithm. We will see the effects of these parameters in the example later in this section.

The forward propagation phase of the algorithm simply calculates the activation values of all nodes in the various layers of the ANN. As we mentioned earlier, the activation values of the nodes in the input layer are the input values and the bias input of the ANN. This can be formally defined by using the following equation:

Understanding the backpropagation algorithm

Using these activation values from the input layer of the ANN, the activation of the nodes in the other layers of the ANN is determined. This is done by applying the activation function to the products of the weight matrix of a given layer and the activation values from the previous layer in the ANN. This can be formally expressed as follows:

Understanding the backpropagation algorithm

The preceding equation explains that the activation value of a layer l is equal to the activation function applied to the output (or activation) values of the previous layer and the given layer's weight matrix. Next, the activation values of the output layer are backpropagated. By this, we mean that that the activation values are traversed from the output layer through the hidden layers to the input layer of the ANN. During this phase, we determine the amount of error or delta in each node in the ANN. The delta values of the output layer are determined by calculating the difference between the expected output values, Understanding the backpropagation algorithm, and the activation values of the output layer, Understanding the backpropagation algorithm. This difference calculation can be summarized by the following equation:

Understanding the backpropagation algorithm

The term Understanding the backpropagation algorithm of a layer l is a matrix of size Understanding the backpropagation algorithm where j is the number of nodes in layer l. This term can be formally defined as follows:

Understanding the backpropagation algorithm

The delta terms of the layers other than the output layer of the ANN are determined by the following equality:

Understanding the backpropagation algorithm

In the preceding equation, the binary operation Understanding the backpropagation algorithm is used to represent an element-wise multiplication of two matrices of equal size. Note that this operation is different from matrix multiplication, and an element-wise multiplication will return a matrix composed of the products of the elements with the same position in two matrices of equal size. The term Understanding the backpropagation algorithm represents the derivative of the activation function used in the ANN. As we are using the sigmoid function as our activation function, the term Understanding the backpropagation algorithm has the value Understanding the backpropagation algorithm.

Thus, we can calculate the delta values of all nodes in the ANN. We can use these delta values to determine the gradients of the synapses of the ANN. We now move on to the final weight update phase of the backpropagation algorithm.

The gradients of the various synapses are first initialized to matrices with all the elements as 0. The size of a gradient matrix of a given synapse is the same size as the weight matrix of the synapse. The gradient term Understanding the backpropagation algorithm represents the gradients of the synapse layer that is present immediately after layer l in the ANN. The initialization of the gradients of the synapses in the ANN is formally expressed as follows:

Understanding the backpropagation algorithm

For each sample value in the training data, we calculate the deltas and activation values of all nodes in the ANN. These values are added to the gradients of the synapses using the following equation:

Understanding the backpropagation algorithm

We then calculate the average of the gradients for all the sample values and use the delta and gradient values of a given layer to update the weight matrix as follows:

Understanding the backpropagation algorithm

Thus, the learning rate and learning momentum parameters of the algorithm come into play only in the weight update phase. The preceding three equations represent a single iteration of the backpropagation algorithm. A large number of iterations must be performed until the overall error in the ANN converges to a small value. We can now summarize the backpropagation learning algorithm using the following steps:

  1. Initialize the weights of the synapses of the ANN to random values.
  2. Select a sample value and forward propagate the sample values through several layers of the ANN to generate the activations of every node in the ANN.
  3. Backpropagate the activations generated by the last layer of the ANN through the hidden layers and to the input layer of the ANN. Through this step, we calculate the error or delta of every node in the ANN.
  4. Calculate the product of the errors generated from step 3 with the synapse weights or input activations for all the nodes in the ANN. This step produces the gradient of weight for each node in the network. Each gradient is represented by a ratio or percentage.
  5. Calculate the changes in the weights of the synapse layers in the ANN using the gradients and deltas of a given layer in the ANN. These changes are then subtracted from the weights of the synapses in the ANN. This is essentially the weight update step of the backpropagation algorithm.
  6. Repeat steps 2 to 5 for the rest of the samples in the training data.

There are several distinct parts in the backpropagation learning algorithm, and we will now implement each part and combine it into a complete implementation. As the deltas and weights of the synapses and activations in an ANN can be represented by matrices, we can write a vectorized implementation of this algorithm.

Note

Note that for the following example, we require functions from the incanter.core namespace from the Incanter library. The functions in this namespace actually use the Clatrix library for the representation of a matrix and its manipulation.

Let's assume that we need to implement an ANN to model a logical XOR gate. The sample data is simply the truth table of the XOR gate and can be represented as a vector, shown as follows:

;; truth table for XOR logic gate
(def sample-data [[[0 0] [0]]
                  [[0 1] [1]]
                  [[1 0] [1]]
                  [[1 1] [0]]])

Each element defined in the preceding vector sample-data is itself a vector comprising other vectors for the input and output values of an XOR gate. We will use this vector as our training data for building an ANN. This is essentially a classification problem, and we will use ANNs to model it. In abstract terms, an ANN should be capable of performing both binary and multiclass classifications. We can define the protocol of an ANN as follows:

(defprotocol NeuralNetwork
  (run        [network inputs])
  (run-binary [network inputs])
  (train-ann  [network samples]))

The NeuralNetwork protocol defined in the preceding code has three functions. The train-ann function can be used to train the ANN and requires some sample data. The run and run-binary functions can be used on this ANN to perform multiclass and binary classifications, respectively. Both the run and run-binary functions require a set of input values.

The first step of the backpropagation algorithm is the initialization of the weights of the synapses of the ANN. We can use the rand and matrix functions to generate these weights as a matrix, shown as follows:

(defn rand-list
  "Create a list of random doubles between 
  -epsilon and +epsilon."
  [len epsilon]
  (map (fn [x] (- (rand (* 2 epsilon)) epsilon))
         (range 0 len)))

(defn random-initial-weights
  "Generate random initial weight matrices for given layers.
  layers must be a vector of the sizes of the layers."
  [layers epsilon]
  (for [i (range 0 (dec (length layers)))]
    (let [cols (inc (get layers i))
          rows (get layers (inc i))]
      (matrix (rand-list (* rows cols) epsilon) cols))))

The rand-list function shown in the preceding code creates a list of random elements in the positive and negative range of epsilon. As we described earlier, we choose this range to break the symmetry of the weight matrix.

The random-initial-weights function generates several weight matrices for different layers of the ANN. As defined in the preceding code, the layers argument must be a vector of the sizes of the layers of the ANN. For an ANN with two nodes in the input layer, three nodes in the hidden layer, and one node in the output layer, we pass layers as [2 3 1] to the random-initial-weights function. Each weight matrix has a number of columns equal to the number of inputs and number of rows equal to the number of nodes in the next layer of the ANN. We set the number of columns in a weight matrix of a given layer to the number of inputs, plus an extra input for the bias of the neural layer. Note that we use a slightly different form of the matrix function. This form takes a single vector and partitions this vector into a matrix that has a number of columns as specified by second argument to this function. Thus, the vector passed to this form of the matrix function must have (* rows cols) elements, where rows and cols are the number of rows and columns, respectively, in the weight matrix.

As we will need to apply the sigmoid function to all the activations of a layer in the ANN, we must define a function that applies the sigmoid function on all the elements in a given matrix. We can use the div, plus, exp, and minus functions from the incanter.core namespace to implement such a function, as shown in the following code:

(defn sigmoid
  "Apply the sigmoid function 1/(1+exp(-z)) to all 
  elements in the matrix z."
  [z]
  (div 1 (plus 1 (exp (minus z)))))

Note

Note that all of the previously defined functions apply the corresponding arithmetic operation on all the elements in a given matrix and returns a new matrix.

We will also need to implicitly add a bias node to each layer in an ANN. This can be done by wrapping around the bind-rows function, which adds a row of elements to a matrix, as shown in the following code:

(defn bind-bias
  "Add the bias input to a vector of inputs."
  [v]
  (bind-rows [1] v))

Since the bias value is always 1, we specify the row of elements as [1] to the bind-rows function.

Using the functions defined earlier, we can implement forward propagation. We essentially have to multiply the weights of a given synapse between two layers in an ANN and then apply the sigmoid function on each of the generated activation values, as shown in the following code:

(defn matrix-mult
  "Multiply two matrices and ensure the result is also a matrix."
  [a b]
  (let [result (mmult a b)]
    (if (matrix? result)
      result
      (matrix [result]))))

(defn forward-propagate-layer
  "Calculate activations for layer l+1 given weight matrix 
  of the synapse between layer l and l+1 and layer l activations."
  [weights activations]
  (sigmoid (matrix-mult weights activations)))

(defn forward-propagate
  "Propagate activation values through a network's
  weight matrix and return output layer activation values."
  [weights input-activations]
  (reduce #(forward-propagate-layer %2 (bind-bias %1))
          input-activations weights))

In the preceding code, we first define a matrix-mult function, which performs matrix multiplication and ensures that the result is a matrix. Note that to define matrix-mult, we use the mmult function instead of the mult function that multiplies the corresponding elements in two matrices of the same size.

Using the matrix-mult and sigmoid functions, we can implement the forward propagation step between two layers in the ANN. This is done in the forward-propagate-layer function, which simply multiplies the matrices representing the weights of the synapse between two layers in the ANN and the input activation values while ensuring that the returned value is always a matrix. To propagate a given set of values through all the layers of an ANN, we must add a bias input and apply the forward-propagate-layer function for each layer. This can be done concisely using the reduce function over a closure of the forward-propagate-layer function as shown in the forward-propagate function defined in the preceding code.

Although the forward-propagate function can determine the output activations of the ANN, we actually require the activations of all the nodes in the ANN to use backpropagation. We can do this by translating the reduce function to a recursive function and introducing an accumulator variable to store the activations of every layer in the ANN. The forward-propagate-all-activations function, which is defined in the following code, implements this idea and uses the loop form to recursively apply the forward-propagate-layer function:

(defn forward-propagate-all-activations
  "Propagate activation values through the network 
  and return all activation values for all nodes."
  [weights input-activations]
  (loop [all-weights     weights
         activations     (bind-bias input-activations)
         all-activations [activations]]
    (let [[weights
           & all-weights']  all-weights
           last-iter?       (empty? all-weights')
           out-activations  (forward-propagate-layer
                             weights activations)
           activations'     (if last-iter? out-activations
                                (bind-bias out-activations))
           all-activations' (conj all-activations activations')]
      (if last-iter? all-activations'
          (recur all-weights' activations' all-activations')))))

The forward-propagate-all-activations function defined in the preceding code requires all the weights of the nodes in the ANN and the input values to pass through the ANN as activation values. We first use the bind-bias function to add the bias input to the input activations of the ANN. We then store this value in an accumulator, that is, the variable all-activations, as a vector of all the activations in the ANN. The forward-propagate-layer function is then applied over the weight matrices of the various layers of the ANN, and each iteration adds a bias input to the input activations of the corresponding layer in the ANN.

Note

Note that we do not add the bias input in the last iteration as it computes the output layer of the ANN. Thus, the forward-propagate-all-activations function applies forward propagation of input values through an ANN and returns the activations of every node in the ANN. Note that the activation values in this vector are in the order of the layers of the ANN.

We will now implement the backpropagation phase of the backpropagation learning algorithm. First, we would have to implement a function to calculate the error term Understanding the backpropagation algorithm from the equation Understanding the backpropagation algorithm. We will do this with the help of the following code:

(defn back-propagate-layer
  "Back propagate deltas (from layer l+1) and 
  return layer l deltas."
  [deltas weights layer-activations]
  (mult (matrix-mult (trans weights) deltas)
        (mult layer-activations (minus 1 layer-activations))))

The back-propagate-layer function defined in the preceding code calculates the errors, or deltas, of a synapse layer l in the ANN from the weights of the layer and the deltas of the next layer in the ANN.

Note

Note that we only use matrix multiplication to calculate the term Understanding the backpropagation algorithm via the matrix-mult function. All other multiplication operations are element-wise multiplication of matrices, which is done using the mult function.

Essentially, we have to apply this function from the output layer to the input layer through the various hidden layers of an ANN to produce the delta values of every node in the ANN. These delta values can then be added to the activations of the nodes, thus producing the gradient values by which we must adjust the weights of the nodes in the ANN. We can do this in a manner similar to the forward-propagate-all-activations function, that is, by recursively applying the back-propagate-layer function over the various layers of the ANN. Of course, we have to traverse the layers of the ANN in the reverse order, that is, starting from the output layer, through the hidden layers, to the input layer. We will do this with the help of the following code:

(defn calc-deltas
  "Calculate hidden deltas for back propagation.
  Returns all deltas including output-deltas."
  [weights activations output-deltas]
  (let [hidden-weights     (reverse (rest weights))
        hidden-activations (rest (reverse (rest activations)))]
    (loop [deltas          output-deltas
           all-weights     hidden-weights
           all-activations hidden-activations
           all-deltas      (list output-deltas)]
      (if (empty? all-weights) all-deltas
        (let [[weights
               & all-weights']      all-weights
               [activations
                & all-activations'] all-activations
              deltas'        (back-propagate-layer
                               deltas weights activations)
              all-deltas'    (cons (rest deltas') 
                                    all-deltas)]
          (recur deltas' all-weights' 
                 all-activations' all-deltas'))))))

The calc-deltas function determines the delta values of all the perceptron nodes in the ANN. For this calculation, the input and output activations are not needed. Only the hidden activations, bound to the hidden-activations variable, are needed to calculate the delta values. Also, the weights of the input layer are skipped as they are bound to the hidden-weights variable. The calc-deltas function then applies the back-propagate-layer function to all the weight matrices of each synapse layer in the ANN, thus determining the deltas of all the nodes in the matrix. Note that we don't add the delta of the bias nodes to a computed set of deltas. This is done using the rest function, (rest deltas'), on the calculated deltas of a given synapse layer, as the first delta is that of a bias input in a given layer.

By definition, the gradient vector terms for a given synapse layer Understanding the backpropagation algorithm are determined by multiplying the matrices Understanding the backpropagation algorithm and Understanding the backpropagation algorithm, which represent the deltas of the next layer and activations of the given layer respectively. We will do this with the help of the following code:

(defn calc-gradients
  "Calculate gradients from deltas and activations."
  [deltas activations]
  (map #(mmult %1 (trans %2)) deltas activations))

The calc-gradients function shown in the preceding code is a concise implementation of the term Understanding the backpropagation algorithm. As we will be dealing with a sequence of delta and activation terms, we use the map function to apply the preceding equality to the corresponding deltas and activations in the ANN. Using the calc-deltas and calc-gradient functions, we can determine the total error in the weights of all nodes in the ANN for a given training sample. We will do this with the help of the following code:

(defn calc-error
  "Calculate deltas and squared error for given weights."
  [weights [input expected-output]]
  (let [activations    (forward-propagate-all-activations 
                        weights (matrix input))
        output         (last activations)
        output-deltas  (minus output expected-output)
        all-deltas     (calc-deltas 
                        weights activations output-deltas)
        gradients      (calc-gradients all-deltas activations)]
    (list gradients
          (sum (pow output-deltas 2)))))

The calc-error function defined in the preceding code requires two parameters—the weight matrices of the synapse layers in the ANN and a sample training value, which is shown as [input expected-output]. The activations of all the nodes in the ANN are first calculated using the forward-propagate-all-activations function, and the delta value of the last layer is calculated as the difference of the expected output value and the actual output value produced by the ANN. The output value calculated by the ANN is simply the last activation value produced by the ANN, shown as (last activations) in the preceding code. Using the calculated activations, the deltas of all the perceptron nodes are determined via the calc-deltas function. These delta values are in turn used to determine the gradients of weights in the various layers of the ANN using the calc-gradients function. The Mean Square Error (MSE) of the ANN for the given sample value is also calculated by adding the squares of the delta values of the output layer of the ANN.

For a given weight matrix of a layer in the ANN, we must initialize the gradients for the layer as a matrix with the same dimensions as the weight matrix, and all the elements in the gradient matrix must be set to 0. This can be implemented using a composition of the dim function, which returns the size of a matrix as a vector, and a variant form of the matrix function, as shown in the following code:

(defn new-gradient-matrix
  "Create accumulator matrix of gradients with the
  same structure as the given weight matrix
  with all elements set to 0."
  [weight-matrix]
  (let [[rows cols] (dim weight-matrix)]
    (matrix 0 rows cols)))

In the new-gradient-matrix function defined in the preceding code, the matrix function expects a value, the number of rows and the number of columns to initialize a matrix. This function produces an initialized gradient matrix with the same structure as the supplied weight matrix.

We now implement the calc-gradients-and-error function to apply the calc-error function on a set of weight matrices and sample values. We must basically apply the calc-error function to each sample and accumulate the sum of the gradient and the MSE values. We then calculate the average of these accumulated values to return the gradient matrices and total MSE for the given sample values and weight matrices. We will do this with the help of the following code:

(defn calc-gradients-and-error' [weights samples]
  (loop [gradients   (map new-gradient-matrix weights)
         total-error 1
         samples     samples]
    (let [[sample
           & samples']     samples
           [new-gradients
            squared-error] (calc-error weights sample)
            gradients'     (map plus new-gradients gradients)
            total-error'   (+ total-error squared-error)]
      (if (empty? samples')
        (list gradients' total-error')
        (recur gradients' total-error' samples')))))

(defn calc-gradients-and-error
  "Calculate gradients and MSE for sample
  set and weight matrix."
  [weights samples]
  (let [num-samples   (length samples)
        [gradients
         total-error] (calc-gradients-and-error'
                       weights samples)]
    (list
      (map #(div % num-samples) gradients)    ; gradients
      (/ total-error num-samples))))          ; MSE

The calc-gradients-and-error function defined in the preceding code relies on the calc-gradients-and-error' helper function. The calc-gradients-and-error' function initializes the gradient matrices, performs the application of the calc-error function, and accumulates the calculated gradient values and MSE. The calc-gradients-and-error function simply calculates the average of the accumulated gradient matrices and MSE returned from the calc-gradients-and-error' function.

Now, the only missing piece in our implementation is modifying the weights of the nodes in the ANN using calculated gradients. In brief, we must repeatedly update the weights until a convergence in the MSE is observed. This is actually a form of gradient descent applied to the nodes of an ANN. We will now implement this variant of gradient descent in order to train the ANN by repeatedly modifying the weights of the nodes in the ANN, as shown in the following code:

(defn gradient-descent-complete?
  "Returns true if gradient descent is complete."
  [network iter mse]
  (let [options (:options network)]
    (or (>= iter (:max-iters options))
        (< mse (:desired-error options)))))

The gradient-descent-complete? function defined in the preceding code simply checks for the termination condition of gradient descent. This function assumes that the ANN, represented as a network, is a map or record that contains the :options keyword. The value of this key is in turn another map that contains the various configuration options of the ANN. The gradient-descent-complete? function checks whether the total MSE of the ANN is less than the desired MSE, which is specified by the :desired-error option. Also, we add another condition to check if the number of iterations performed exceeds the maximum number of iterations specified by the :max-iters option.

Now, we will implement a gradient-descent function for multilayer perceptron ANNs. In this implementation, the changes in weights are calculated by the step function provided by the gradient descent algorithm. These calculated changes are then simply added to the existing weights of the synapse layers of the ANN. We will implement the gradient-descent function for multilayer perceptron ANNs with the help of the following code:

(defn apply-weight-changes
  "Applies changes to corresponding weights."
  [weights changes]
  (map plus weights changes))

(defn gradient-descent
  "Perform gradient descent to adjust network weights."
  [step-fn init-state network samples]
  (loop [network network
         state init-state
         iter 0]
    (let [iter     (inc iter)
          weights  (:weights network)
          [gradients
           mse]    (calc-gradients-and-error weights samples)]
      (if (gradient-descent-complete? network iter mse)
        network
        (let [[changes state] (step-fn network gradients state)
              new-weights     (apply-weight-changes 
                               weights changes)
              network         (assoc network 
                              :weights new-weights)]
          (recur network state iter))))))

The apply-weight-changes function defined in the preceding code simply adds the weights and the calculated changes in the weights of the ANN. The gradient-descent function requires a step function (specified as step-fn), the initial state of the ANN, the ANN itself, and the sample data to train the ANN. This function must calculate the weight changes from the ANN, the initial gradient matrices, and the initial state of the ANN. The step-fn function also returns the changed state of the ANN. The weights of the ANN are then updated using the apply-weight-changes function, and this iteration is repeatedly performed until the gradient-descent-complete? function returns as true. The weights of the ANN are specified by the :weights keyword in the network map. These weights are then updated by simply overwriting the value on the network specified by the :weights keyword.

In the context of the backpropagation algorithm, we need to specify the learning rate and learning momentum by which the ANN must be trained. These parameters are needed to determine the changes in the weights of the nodes in the ANN. A function implementing this calculation must then be specified as the step-fn parameter to the gradient-descent function, as shown in the following code:

(defn calc-weight-changes
  "Calculate weight changes:
  changes = learning rate * gradients + 
            learning momentum * deltas."
  [gradients deltas learning-rate learning-momentum]
  (map #(plus (mult learning-rate %1)
              (mult learning-momentum %2))
       gradients deltas))

(defn bprop-step-fn [network gradients deltas]
  (let [options             (:options network)
        learning-rate       (:learning-rate options)
        learning-momentum   (:learning-momentum options)
        changes             (calc-weight-changes
                             gradients deltas
                             learning-rate learning-momentum)]
    [(map minus changes) changes]))

(defn gradient-descent-bprop [network samples]
  (let [gradients (map new-gradient-matrix (:weights network))]
    (gradient-descent bprop-step-fn gradients
                      network samples)))

The calc-weight-changes function defined in the preceding code calculates the change of weights, termed as Understanding the backpropagation algorithm, from the gradient values and deltas of a given layer in the ANN. The bprop-step-fn function extracts the learning rate and learning momentum parameters from the ANN that is represented by network and uses the calc-weight-changes function. As the weights will be added with the changes by the gradient-descent function, we return the changes in weights as negative values using the minus function.

The gradient-descent-bprop function simply initializes the gradient matrices for the given weights of the ANN and calls the gradient-descent function by specifying bprop-step-fn as the step function to be used. Using the gradient-descent-bprop function, we can implement the abstract NeuralNetwork protocol we had defined earlier, as follows:

(defn round-output
  "Round outputs to nearest integer."
  [output]
  (mapv #(Math/round ^Double %) output))

(defrecord MultiLayerPerceptron [options]
  NeuralNetwork

  ;; Calculates the output values for the given inputs.
  (run [network inputs]
    (let [weights (:weights network)
          input-activations (matrix inputs)]
      (forward-propagate weights input-activations)))

  ;; Rounds the output values to binary values for
  ;; the given inputs.
  (run-binary [network inputs]
    (round-output (run network inputs)))

  ;; Trains a multilayer perceptron ANN from sample data.
  (train-ann [network samples]
    (let [options         (:options network)
          hidden-neurons  (:hidden-neurons options)
          epsilon         (:weight-epsilon options)
          [first-in
           first-out]     (first samples)
          num-inputs      (length first-in)
          num-outputs     (length first-out)
          sample-matrix   (map #(list (matrix (first %)) 
                                      (matrix (second %)))
                               samples)
          layer-sizes     (conj (vec (cons num-inputs 
                                           hidden-neurons))
                                num-outputs)
          new-weights     (random-initial-weights 
                           layer-sizes epsilon)
          network         (assoc network :weights new-weights)]
      (gradient-descent-bprop network sample-matrix))))

The MultiLayerPerceptron record defined in the preceding code trains a multilayer perceptron ANN using the gradient-descent-bprop function. The train-ann function first extracts the values for the number of hidden neurons and the constant Understanding the backpropagation algorithm from the options map specified to the ANN. The sizes of the various synapse layers in the ANN are first determined from the sample data and bound to the layer-sizes variable. The weights of the ANN are then initialized using the random-initial-weights function and updated in the record network using the assoc function. Finally, the gradient-descent-bprop function is called to train the ANN using the backpropagation learning algorithm.

The ANN defined by the MultiLayerPerceptron record also implements two other functions, run and run-binary, from the NeuralNetwork protocol. The run function uses the forward-propagate function to determine the output values of a trained MultiLayerPerceptron ANN. The run-binary function simply rounds the value of the output returned by the run function for the given set of input values.

An ANN created using the MultiLayerPerceptron record requires a single options parameter containing the various options we can specify for the ANN. We can define the default options for such an ANN as follows:

(def default-options
  {:max-iters 100
   :desired-error 0.20
   :hidden-neurons [3]
   :learning-rate 0.3
   :learning-momentum 0.01
   :weight-epsilon 50})

(defn train [samples]
  (let [network (MultiLayerPerceptron. default-options)]
    (train-ann network samples)))

The map defined by the default-options variable contains the following keys that specify the options for the MultiLayerPerceptron ANN:

  • :max-iter: This key specifies the maximum number of iterations to run the gradient-descent function.
  • :desired-error: This variable specifies the expected or acceptable MSE in the ANN.
  • :hidden-neurons: This variable specifies the number of hidden neural nodes in the network. The value [3] represents a single hidden layer with three neurons.
  • :learning-rate and :learning-momentum: These keys specify the learning rate and learning momentum for the weight update phase of the backpropagation learning algorithm.
  • :epsilon: This variable specifies the constant used by the random-initial-weights function to initialize the weights of the ANN.

We also define a simple helper function train to create an ANN of the MultiLayerPerceptron type and train the ANN using the train-ann function and the sample data specified by the samples parameter. We can now create a trained ANN from the training data specified by the sample-data variable as follows:

user> (def MLP (train sample-data))
#'user/MLP

We can then use the trained ANN to predict the output of some input values. The output generated by the ANN defined by MLP closely matches the output of an XOR gate as follows:

user> (run-binary MLP  [0 1])
[1]
user> (run-binary MLP  [1 0])
[1]

However, the trained ANN produces incorrect outputs for some set of inputs as follows:

user> (run-binary MLP  [0 0])
[0]
user> (run-binary MLP  [1 1]) ;; incorrect output generated
[1]

There are several measures we can implement in order to improve the accuracy of the trained ANN. First, we can regularize the calculated gradients using the weights matrices of the ANN. This modification will produce a noticeable improvement in the preceding implementation. We can also increase the maximum number of iterations to be performed. We can also tune the algorithm to perform better by tweaking the learning rate, the learning momentum, and the number of hidden nodes in the ANN. These modifications are skipped as they have to be done by the reader.

The Enclog library (http://github.com/jimpil/enclog) is a Clojure wrapper library for the Encog library for machine learning algorithms and ANNs. The Encog library (http://github.com/encog) has two primary implementations: one in Java and one in .NET. We can use the Enclog library to easily generate customized ANNs to model both supervised and unsupervised machine learning problems.

Note

The Enclog library can be added to a Leiningen project by adding the following dependency to the project.clj file:

[org.encog/encog-core "3.1.0"]
[enclog "0.6.3"]

Note that the Enclog library requires the Encog Java library as a dependency.

For the example that will follow, the namespace declaration should look similar to the following declaration:

(ns my-namespace
  (:use [enclog nnets training]))

We can create an ANN from the Enclog library using the neural-pattern and network functions from the enclog.nnets namespace. The neural-pattern function is used to specify a neural network model for the ANN. The network function accepts a neural network model returned from the neural-pattern function and creates a new ANN. We can provide several options to the network function depending on the specified neural network model. A feed-forward multilayer perceptron network is defined as follows:

(def mlp (network (neural-pattern :feed-forward)
                  :activation :sigmoid
                  :input      2
                  :output     1
                  :hidden     [3]))

For a feed-forward neural network, we can specify the activation function with the :activation key to the network function. For our example, we used the sigmoid function, which is specified as :sigmoid, as the activation function for the ANNs nodes. We also specified the number of nodes in the input, output, and hidden layers of the ANN using the :input, :output, and :hidden keys.

To train an ANN created by the network function with some sample data, we use the trainer and train functions from the enclog.training namespace. The learning algorithm to be used to train the ANN must be specified as the first parameter to the trainer function. For the backpropagation algorithm, this parameter is the :back-prop keyword. The value returned by the trainer function represents an ANN as well as the learning algorithm to be used to train the ANN. The train function is then used to actually run the specified training algorithm on the ANN. We will do this with the help of the following code:

(defn train-network [network data trainer-algo]
  (let [trainer (trainer trainer-algo
                         :network network
                         :training-set data)]
    (train trainer 0.01 1000 []))) ;; 0.01 is the expected error

The train-network function defined in the preceding code takes three parameters. The first parameter is an ANN created by the network function, the second parameter is the training data to be used to train the ANN, and the third parameter specifies the learning algorithm by which the ANN must be trained. As shown in the preceding code, we can specify the ANN and the training data to the trainer function using the key parameters, :network and :training-set. The train function is then used to run the training algorithm on the ANN using the sample data. We can specify the expected error in the ANN and the maximum number of iterations to run the training algorithm as the first and second parameters to the train function. In the preceding example, the desired error is 0.01, and the maximum number of iterations is 1000. The last parameter passed to the train function is a vector specifying the behaviors of the ANN, and we ignore it by passing it as an empty vector.

The training data to be used to run the training algorithm on the ANN can be created using Enclog's data function. For example, we can create a training data for a logical XOR gate using the data function as follows:

(def dataset
  (let [xor-input [[0.0 0.0] [1.0 0.0] [0.0 1.0] [1.0 1.0]]
        xor-ideal [[0.0]     [1.0]     [1.0]     [0.0]]]
        (data :basic-dataset xor-input xor-ideal)))

The data function requires the type of data as the first parameter of the function, followed by the input and output values of the training data as vectors. For our example, we will use the :basic-dataset and :basic parameters. The :basic-dataset keyword can be used to create training data, and the :basic keyword can be used to specify a set of input values.

Using the data defined by the dataset variable and the train-network function, we can train the ANN's MLP to model the output of an XOR gate as follows:

user> (def MLP (train-network mlp dataset :back-prop))
Iteration # 1 Error: 26.461526% Target-Error: 1.000000%
Iteration # 2 Error: 25.198031% Target-Error: 1.000000%
Iteration # 3 Error: 25.122343% Target-Error: 1.000000%
Iteration # 4 Error: 25.179218% Target-Error: 1.000000%
...
...
Iteration # 999 Error: 3.182540% Target-Error: 1.000000%
Iteration # 1,000 Error: 3.166906% Target-Error: 1.000000%
#'user/MLP

As shown by the preceding output, the trained ANN has an error of about 3.16 percent. We can now use the trained ANN to predict the output of a set of input values. To do this, we use the Java compute and getData methods, which are specified by .compute and .getData respectively. We can define a simple helper function to call the .compute method for a vector of input values and round the output to a binary value as follows:

(defn run-network [network input]
  (let [input-data (data :basic input)
        output     (.compute network input-data)
        output-vec (.getData output)]
    (round-output output-vec)))

We can now use the run-network function to test the trained ANN using a vector of input values, as follows:

user> (run-network MLP [1 1])
[0]
user> (run-network MLP [1 0])
[1]
user> (run-network MLP [0 1])
[1]
user> (run-network MLP [0 0])
[0]

As shown in the preceding code, the trained ANN represented by MLP completely matches the behavior of an XOR gate.

In conclusion, the Enclog library gives us a small set of powerful functions that can be used to build ANNs. In the preceding example, we explored a feed-forward multilayer perceptron model. The library provides several other ANN models, such as Adaptive Resonance Theory (ART), Self-Organizing Maps (SOM), and Elman networks. The Enclog library also allows us to customize the activation function of the nodes in a particular neural network model. For the feed-forward network in our example, we've used the sigmoid function. Several mathematical functions, such as sine, hyperbolic tan, logarithmic, and linear functions, are also supported by the library. There are also several machine learning algorithms supported by the Enclog library that can be used to train an ANN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.172.130