© Donald J. Norris 2017

Donald J. Norris, Beginning Artificial Intelligence with the Raspberry Pi, 10.1007/978-1-4842-2743-5_8

8. Machine Learning: Deep Learning

Donald J. Norris

(1)Barrington, New Hampshire, USA

This is the third chapter in the series on machine learning. The focus is on generalized artificial neural networks (ANNs) . Covering this topic requires an extensive background discussion containing a fair amount of math, so be forewarned. The Python implementation on the Raspberry Pi also takes a good deal of discussion. I try my best to keep it all interesting and to the point.

Let’s start with a brief review of some fundamentals, and then move on to some calculations for a larger three-layer, nine-node ANN using Python and matrix algorithms imported from the numpy library. You’ll also look at some propagation examples, which is followed by a discussion on gradient descent (GD) .

Two demonstrations are provided later in this chapter. The first one shows you how to create an untrained ANN. The second demonstration shows you how to train an ANN to generate useful results. Several practical ANN demonstrations using the techniques presented in this chapter are shown in Chapter 9. There is simply too much ANN content to present in a single chapter.

When you complete this chapter, you will have gained a good amount of theoretical and practical knowledge on how to create a useful ANN.

Generalized ANN

At this point in the book, I have covered quite a bit on the subject of ANN, but there is still a considerable amount to discuss. What should be clear to you at this stage in the book is that an ANN is a mathematical representation or model of the many neurons and their interconnections in a human brain. ANN basics were discussed in Chapter 2. I introduced you to a specialized ANN in Chapter 7 that was well suited for a fairly simple robotic application. However, the field of ANNs is quite broad and there is still much to cover.

In this chapter’s title, I used the phrase deep learning, which I also briefly mentioned in Chapter 2. Deep learning is commonly used by AI practitioners to refer to multilayer ANNs, which are able to learn by having repeated training data sets applied to them. Figure 8-1 shows a three-layer ANN.

A436848_1_En_8_Fig1_HTML.jpg
Figure 8-1. Three-layer ANN

The following explains the layers shown in Figure 8-1.

  • Input: Inputs are applied to this layer.

  • Hidden: All layers that are not classified as input or output are hidden.

  • Output: Outputs appear at this layer.

All the neurons or nodes are interconnected to each other, layer by layer. This means that the input layer connects to all the nodes in the first hidden layer. Likewise, all the nodes in the last hidden layer connect to the output nodes.

I have also referred to this network configuration as a generalized ANN to differentiate from the Hopfield network, which is a special case from the general. The Hopfield network only consists of a single layer where all the nodes serve as both inputs and outputs, and there are no hidden layers. From this point on when I speak of an ANN, I am referring to the generalized type with multiple layers.

There are two broad categories for ANNs:

  • FeedForward: Data flow is unidirectional. Nodes send data from one layer to the next one.

  • Feedback: Data is bidirectional using feedback loops.

Figure 8-2 shows models for both of these ANN types.

A436848_1_En_8_Fig2_HTML.jpg
Figure 8-2. FeedForward and Feedback ANN models

An input to an ANN is just a pattern of numbers that propagate through the network where each node sums the inputs and if the sum exceeds a threshold value causes the node to fire and output a number to the next connected node. The connection strength between nodes is known as the weighting, as I have previously described in the Hopfield network. Determining the weight values is the key element in how an ANN learns. ANN learning usually happens when many training data sets are applied to the network. These training data sets contain both input and output data. The input data creates output data, which is then compared to the true output data with error results created when the values do not agree. This error data is consequently feedback through the ANN and the weights adjusted in an incremental fashion in accordance with a pre-programmed learning algorithm. Over many training cycles, often thousands, the ANN is trained to compute the desired output for a given input. This learning technique is called back propagation.

Figure 8-3 shows a three-layer ANN with all the associated weights interconnecting the nodes. The weights are shown in a wi,j notation where i is the source node and j is the receiving or destination node. The stronger the weight the more the source node affects the destination node. Of course, the reverse is also true.

A436848_1_En_8_Fig3_HTML.jpg
Figure 8-3. Three-layer ANN with weights

If you examine Figure 8-3 closely, you see that not all layer-to-layer nodes are interconnected. For instance, input layer node 1 has no connection with hidden layer node 3. This can be remedied if it is determined that the network cannot be adequately trained. Adding more node-to-node connections is quite easy to do using matrix operations, as you will see shortly. Adding more connections does no real harm because the connection weights are adjusted. The network is trained to the point where unnecessary connections are assigned a 0 weighting value, which effectively removes them from the network.

At this point, it is useful to actually follow a signal path through a simplified ANN so that that you have a good understanding of the inner workings of this type of network. I use a very simple two-layer, four-node network for this example because it more than suffices for this purpose. Figure 8-4 shows this network, which only consists of one input and one output layer. No hidden layers are necessary in this ANN.

A436848_1_En_8_Fig4_HTML.jpg
Figure 8-4. Two-layer ANN

Now, let’s assign the following values to the inputs and weights shown in Figure 8-4, as listed in Table 8-1.

Table 8-1. Input and Weight Values for Example ANN

Symbol

Value

in1

0.8

in2

0.4

w1,1

0.8

w1,2

0.1

w2,2

0.4

w2,1

0.9

These values were selected randomly and do not represent nor model any physical situation. Often times, weights are randomly assigned with the intention that it is easier to promote a rapid convergence to an optimal, trained solution. With so few inputs and weights involved, I did not see it as an issue to omit a diagram with these real values. You can easily scribble out a diagram with the values if that helps you understand the following steps.

I start the calculation with node 1 in layer 2 because there are no modifications that take place between the data input and the input nodes. The input nodes exist as a convenience for network computations. There is no weighting directly applied by the input layer nodes to the data input set. Recall from Chapter 2 that the node sums all the weighted inputs from all of its interconnected nodes. In this case, node 1 in layer 2 has inputs from both nodes in layer 1. The weighted sum is therefore

w1,1 * in1 + w2,1 * in2 = 0.8 * 0.8 + 0.9 * 0.4 = 0.64 + 0.36 = 1.00

Let’s next assume that the activation function is the standard sigmoid expression that I also described in Chapter 2. The sigmoid equation is:

$$ y = 1/left(1 + {e}^{- x}
ight) $$ where e = math constant 2.71828…

With x = 1.0, this equation becomes:

$$ y = 1/left(1 + {e}^{-1}
ight) $$ = 1/(1.3679) = 0.7310 or out1 = 0.7310

Repeating the preceding steps for the other node in layer 2 yields the following:

w2,2 * in2 + w1,2 * in1 = 0.4*0.4 + 0.1*0.8 = 0.16 + 0.08 = 0.24

Letting x = 0.24 yields this:

$$ y = 1/left(1 + {e}^{-0.24}
ight) $$ = 1/(1.7866) = 0.5597 or out2 = 0.5597

The two ANN outputs have now been determined for the specific input data set. This was a fair amount of manual calculations to perform for this extremely simple two-layer, four-node ANN. I believe you can easily see that it is nearly impossible to manually perform these calculations on much larger networks without generating errors. That is where the computer excels in performing these tedious calculations without error for large ANNs with many layers. I used numpy matrices in the last chapter when doing the Hopfield network multiplications and dot products. Similar matrix operations are applied to this network. The input vector for this example is just the two values: in1 and in2. They are expressed in a vector format as
$$ left{egin{array}{c}hfill in1hfill \ {}hfill in2hfill end{array}
ight} $$

Likewise, the following is the weighting matrix: $$ left{egin{array}{cc}hfill {w}_{1,1}hfill & hfill {w}_{1,2}hfill \ {}hfill {w}_{2,1}hfill & hfill {w}_{2,2}hfill end{array}
ight} $$

Figure 8-5 shows these matrix operations being applied in an interactive Python session. Notice that it only takes a few statements to come up with exactly the same results as was done with the manual calculations.

A436848_1_En_8_Fig5_HTML.jpg
Figure 8-5. Interactive Python session

The next example involves a larger ANN that is handled entirely by a Python script.

Larger ANN

This example involves a three-layer ANN that has three nodes in each layer. The ANN model is shown in Figure 8-6 with an input data set and a portion of the weighting values in an effort not to obscure the diagram.

A436848_1_En_8_Fig6_HTML.jpg
Figure 8-6. Larger ANN

Let’s start with the input data set as that is quite simple. This is shown in a vector format as follows:
$$ mathrm{input}kern0.5em =kern0.5em left{egin{array}{c}hfill 0.8hfill \ {}hfill 0.2hfill \ {}hfill 0.7hfill end{array}
ight} $$

There are two weighting matrices in this example. One is needed to represent the weights between the input layer (wtgih) and the hidden layer and the other for the weights between the hidden layer and the output layer (wtgho). The weights are randomly assigned as was done for the previous examples.
$$ {mathrm{wtg}}_{mathrm{ih}}=left{egin{array}{ccc}hfill {w}_{1,1}hfill & hfill {w}_{1,2}hfill & hfill {w}_{1,3}hfill \ {}hfill {w}_{2,1}hfill & hfill {w}_{2,2}hfill & hfill {w}_{2,3}hfill \ {}hfill {w}_{3,1}hfill & hfill {w}_{3,2}hfill & hfill {w}_{3,3}hfill end{array}
ight}=left{egin{array}{ccc}hfill 0.8hfill & hfill 0.6hfill & hfill 0.3hfill \ {}hfill 0.2hfill & hfill 0.9hfill & hfill 0.3hfill \ {}hfill 0.2hfill & hfill 0.5hfill & hfill 0.8hfill end{array}
ight} $$
$$ {mathrm{wtg}}_{mathrm{ho}}=left{egin{array}{ccc}hfill {w}_{1,1}hfill & hfill {w}_{1,2}hfill & hfill {w}_{1,3}hfill \ {}hfill {w}_{2,1}hfill & hfill {w}_{2,2}hfill & hfill {w}_{2,3}hfill \ {}hfill {w}_{3,1}hfill & hfill {w}_{3,2}hfill & hfill {w}_{3,3}hfill end{array}
ight}=left{egin{array}{ccc}hfill 0.4hfill & hfill 0.8hfill & hfill 0.4hfill \ {}hfill 0.5hfill & hfill 0.7hfill & hfill 0.2hfill \ {}hfill 0.9hfill & hfill 0.1hfill & hfill 0.6hfill end{array}
ight} $$

Figure 8-7 shows the matrix multiplication for the input to the hidden layer. The resultant matrix is shown as X1 in the screenshot.

A436848_1_En_8_Fig7_HTML.jpg
Figure 8-7. First matrix multiplication

The sigmoid activation function has to be applied next to this resultant. I called the transformed matrix O1 to indicate it was an output from the hidden layer to the real output layer. The resultant O1 matrix is:

matrix([[ 0.69423634,  0.73302015,  0.70266065]])

These are the values multiplied by the weighting matrix wtgho. Figure 8-8 shows this multiplication. I called the resultant matrix X2 to differentiate from the first one. The final sigmoid calculation is also shown in the screenshot, which I named O2.

A436848_1_En_8_Fig8_HTML.jpg
Figure 8-8. Second matrix multiplication

The matrix O2 is also the final output from the ANN, which is

matrix([[ 0.78187033,  0.7574536,  0.69970531]])

This output should reflect the input so let’s compare the two and calculate the error or difference between the two. All of this is shown in Table 8-2.

Table 8-2. Comparison of ANN Outputs with the Inputs

Input

Output

Error

0.8

0.78187033

0.01812967

0.2

0.7574536

-0.5574536

0.7

0.69970531

0.00029469

The results are actually quite remarkable as two of the three outputs are very close to the respective input values. However, the middle value is way off, which indicates that at least some of the ANN weights must be modified. But, how do you do it?

Before I show you how that is done, please consider the situation shown in Figure 8-9.

A436848_1_En_8_Fig9_HTML.jpg
Figure 8-9. Error allocation problem

In Figure 8-9, two nodes are connected to one output node, which has an error value. How can the error be reflected back to the weights interconnecting the nodes? In one case, you could evenly split the error between the input nodes. However, that would not accurately represent the true error contribution from the input nodes as node 1 has twice the weight or impact as node 2. A moments thought should lead you to the correct solution that the error should be divided in direct proportion to the weighting values connecting the nodes. In the case of the two input nodes shown in Figure 8-9, node 1 should be responsible for two-thirds of the error while node 2 should have one-third of the error contribution, which is precisely the ratios of their respective weights to the sum applied to the output node.

This use of the weights in this fashion is an additional feature for the weighting matrix. Normally, weights are applied to signals propagating in a forward direction through the ANN. However, this approach uses weights with the error value, which is then propagated in a backwards direction. This is reason that error determination is also called back propagation.

Consider next what would happen if there were errors appearing at more than one output node, which is likely the case in most initial ANN startups. Figure 8-10 shows this situation.

A436848_1_En_8_Fig10_HTML.jpg
Figure 8-10. Error allocation problem for multiple output nodes

It turns out the process is identical for multiple nodes as it was for a single node. This is true because the output nodes are independent of one another, with no interconnecting links. If this were not true, it would be very difficult to back propagate from interlinked output nodes.

The equation to apportion the error is also very simple. It is a just a fraction based on the weights connected to the output node. For instance, to determine the correction for e1 in Figure 8-10, the fractions applied to w1,1 and w2,1 are as follows:

  • w1,1/( w1,1 + w2,1)  and   w2,1/( w1,1 + w2,1)

Similarly, the following are the errors for e2.

  • w1,2/( w1,2 + w2,2)  and   w2,2/( w1,2 + w2,2)

So far, the process to adjust the weights based on the output errors has been quite simple. The errors are easy to determine because the training data provides the correct answers. For two-layer ANNs, this is all that is needed. But how do you handle a three-layer ANN where there are most certainly errors in the hidden layer output, yet there is no training data available, which can be used to determine the error values?

Back Propagation In Three-layer ANNs

Figure 8-11 shows a three-layer, six-node ANN with two nodes per layer. I deliberately simplified this ANN so that it is relatively easy to focus on the limited back propagation required for the network.

A436848_1_En_8_Fig11_HTML.jpg
Figure 8-11. Three-layer, six-node ANN with error values

In Figure 8-11, you should be able to see the output error values that were arbitrarily created for this example. Individual error contributions from nodes 1 and 2 of the hidden layer are shown at the inputs to each of the output nodes . These normalized values were calculated as follows:

  • e1output1 * w1,1/(w1,1 + w2,1) = 0.96 * 2/(2 + 3) = 0.96 * 0.4 = 0.38

  • e1output2 * w2,1/(w1,1 + w2,1) = 0.96 * 3/(2 + 3) = 0.96 * 0.6 = 0.58

  • e2output1 * w1,2/(w1,2 + w2,2) = 0.8 * 2/(2 + 1) = 0.8 * 0.66 = 0.53

  • e2output2 * w2,2/(w1,2 + w2,2) = 0.8 * 1/(2 + 1) = 0.8 * 0.33 = 0.27

The total normalized error value for each hidden node is the sum of the individual error contributions to a given output node and are calculated as follows:

  • e1 = e1output1 + e2output1 = 0.38 + 0.53 = 0.91

  • e2 = e1output2 + e2output2 = 0.58 + 0.27 = 0.85

These values are shown next to each of the hidden nodes in Figure 8-11.

The preceding process may be continued as needed to calculate all the combined error values for any remaining hidden layers. There is no need to calculate error values for the input layer because it must be 0 for all input nodes as they simply pass the input values without any modifications.

The preceding process for calculating the hidden layer error outputs is quite tedious since it was done manually. It would be much nicer if it could be automated using matrices in a similar way that the feed forward calculations were done. The following would result if the matrices were translated on a one-to-one basis from the manual method :
$$ {mathrm{e}}_{mathrm{hidden}}=left{egin{array}{cc}hfill frac{w_{1,1}}{w_{1,1} + {w}_{2,1}}hfill & hfill frac{w_{1,2}}{w_{1,2} + {w}_{2,2}}hfill \ {}hfill frac{w_{2,1}}{w_{2,1} + {w}_{1,1}}hfill & hfill frac{w_{2,2}}{w_{2,2} + {w}_{1,2}}hfill end{array}
ight}*left{egin{array}{c}hfill in1hfill \ {}hfill in2hfill end{array}
ight} $$

Unfortunately, there is no reasonable way to input the fractions that are shown in the preceding matrix. But all hope is not lost if you think about what the fractions actually do. They normalize the node’s error contribution, meaning that the fraction converts to a number ranging between 0 and 1.0. The relative error contribution can also be expressed as an unnormalized number by simply using the weight numerator and dropping the denominator. The result is still acceptable because it is really only important to calculate a combined error value that is useful in the weight updates. I cover this in the next section.

Removing the denominators of all the fractions yields
$$ {mathrm{e}}_{mathrm{hidden}}=left{egin{array}{cc}hfill {w}_{1,1}hfill & hfill {w}_{1,2}hfill \ {}hfill {w}_{2,1}hfill & hfill {w}_{2,2}hfill end{array}
ight}*left{egin{array}{c}hfill in1hfill \ {}hfill in2hfill end{array}
ight} $$

The preceding matrix can easily be handled using numpy’s matrix operations. The only catch is that the transpose of the matrix must be used in the multiplication, which again is not an issue. Figure 8-12 shows the actual matrix operations for this error backpropagation example.

A436848_1_En_8_Fig12_HTML.jpg
Figure 8-12. Hidden layer error matrix multiplication

It is now time to discuss how the weighting matrix values are updated once the errors have been determined.

Updating the Weighting Matrix

Updating the weighting matrix is the heart of the ANN learning process. The quality of this matrix determines how effective the ANN is in solving its particular AI problem. However, there is a very significant problem in trying to mathematically determine a node’s output given its input and weights. Consider the following equation, which is applicable to a three-layer, nine-node ANN that determines the value appearing at a particular output node:
$$ {O}_{k = frac{1}{1 + {e}^{-{displaystyle {sum}_{j=1}^3}Big({w}_{j, k} *frac{1}{1 + {e}^{-{displaystyle {sum}_{j=1}^3}left({w}_{j, k}*{x}_i
ight)} }}}} $$

O k is the output at the k th node.

w j,k is all the interconnecting weights between the input layer and selected output node.

x i are the input values.

This certainly is a formidable equation even though it only deals with a relatively simple three-layer, nine-node ANN. You can probably imagine the horrendous equation that would model a six-input, five-layer ANN, which in itself is not that large of an ANN. Larger ANN equations are likely beyond human comprehension. So how is this conundrum solved?

You could try a brute-force approach , where a fast computer simply tries a series of different values for each weight. Let’s say there are 1000 values to test for each weight, ranging from –1 to 1 in increments of 0.002. Negative weights are allowed in an ANN and the 0.002 increment is probably fine to determine an accurate weight. However, for our three-layer, nine-node ANN there are 18 possible weighting links. Since there are 1000 values per link, there are 18,000 possibilities to test. That means it would take approximately five hours to go through all the combinations if the computer took one second for each combination. Five hours is not that bad for a simple ANN, however, the elapsed time would grow exponentially for larger ANNs. For example, in a very practical 500-node ANN, there are about 500 million weight combinations. Testing at one second per combination would take approximately 16 years to complete. And that is only for one training set. Imagine the time taken for thousands of training sets. Obviously, there must be a better way than using the brute-force approach.

The solution for this difficult problem comes from the application of a mathematical approach named steepest descent. This approach was first created in 1847 by a French mathematics professor named Augustin Louis Cauchy. He proposed it in a treatise concerning the solution of a system of simultaneous equations. However, a period of more than 120 years elapsed before mathematicians and AI researchers applied it to ANNs. The field of ANN research rapidly developed once this technique became well known and understood.

The technique is also commonly referred to as gradient descent, which I start doing from this point. The underlying mathematics for gradient descent can be a bit confusing and somewhat obscure, especially when it is applied to ANNs. The following sidebar delves into the details of the gradient descent technique in order to provide those interested readers with a brief background on the subject.

The Gradient Descent Applied to an ANN

Figure 8-21 nicely summarizes the gradient descent technique as it applies to an ANN. It must determine the global minimum by adjusting the weights wi,j so as to minimize the overall errors present in the ANN.

A436848_1_En_8_Fig21_HTML.jpg
Figure 8-21. ANN global minimum

This adjustment becomes a function of the partial derivative of the error function with respect to a weight, wj,k. This partial derivative is shown by these symbols: $$ frac{partial e}{partial {w}_{j, k}} $$. This derivative is also the slope of the error function. It is the gradient descent algorithm that follows the slope down to the global minimum.

Figure 8-22 shows the three-layer, six-node ANN that is the basis network for the following discussion. Note the i, j, and k indices because they are important to follow as you go through the procedure.

A436848_1_En_8_Fig22_HTML.jpg
Figure 8-22. Three-layer, six-node ANN

There is one additional symbol required beyond those shown in Figure 8-22: the output node error, which is expressed as
$$ {e}_k = {t}_k - {o}_k $$
t k is the true or target value from the training set.

o k is output resulting from the training set input x i values.

The total error at any given node n is the preceding equation with n substituted for k. Consequently, the total error for the entire ANN is the sum of all errors for all of the individual nodes. The errors are also squared for reasons mentioned in the sidebar. This leads to the following equation for the error function :
$$ e = {displaystyle sum_{i=1}^N}{left({t}_n - {o}_n
ight)}^2 $$

N is the total number of nodes in the ANN.

This error function is the exact one that is required to differentiate with respect to w j,k  , leading to the following form:
$$ frac{partial e}{partial {w}_{j, k}} = frac{partial }{partial {w}_{j, k}} {displaystyle sum_{i=1}^N}Big({t}_n - {o_{n Big)}}^2 $$

This equation may be considerably simplified by taking note that the error at any specific node is due solely to its input connections. This means the output for the k th node only depends on the w j,k weights on its input connections. What this realization does is to remove the summation from the error function because no other nodes contribute to the k th node’s output. This yields a much simpler error function :
$$ frac{partial e}{partial {w}_{j, k}} = frac{partial }{partial {w}_{j, k}} Big({t}_k - {o_{k Big)}}^2 $$

The next step is to do the actual partial differentiation on the function. I simply go through the steps with minimal comments to get to the final equation without prolonging this whole derivation .

  1. Chain rule applied: $$ frac{partial e}{partial {w}_{j, k}} = frac{partial e}{partial {o}_k} * frac{partial {o}_k}{partial {w}_{j, k}} $$

  2. ok is independent of wj,k. The first partial = $$ -2left({t}_k - {o}_k
ight) $$

  3. The output ok has a sigmoid function applied. The second partial = $$ frac{partial {o}_k}{partial {w}_{j, k}} $$ sigmoid $$ left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight) $$

  4. The differential of the sigmoid is $$ frac{partial }{partial x} sigmoid(x) = sigmoid(x) * left(1 - sigmoid(x)
ight) $$

  5. Combining: $$ frac{partial e}{partial {w}_{j, k}} = -2left({t}_k - {o}_{k }
ight)kern0.5em *kern0.5em mathrm{sigmoid}left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)* left(1 - sigmoidleft({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)
ight)* frac{partial }{partial {w}_{j, k}}left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)Big) $$. Note the last term is necessary because of the summation term in the sigmoid function. Just another application of the chain rule.

  6. Simplifying: $$ frac{partial e}{partial {w}_{j, k}} = -2left({t}_k - {o}_{k }
ight)kern0.5em *kern0.5em mathrm{sigmoid}left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)* left(1 - sigmoidleft({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)
ight)*{o}_j $$

Take a breath, which I often did after a rigorous calculus session. This is the final equation that is used to adjust the weights:
$$ frac{partial e}{partial {w}_{j, k}} = -left({t}_k - {o}_{k }
ight)kern0.5em *kern0.5em mathrm{sigmoid}left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)kern0.5em *kern0.75em left(1 - sigmoidleft({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)
ight)*{o}_j $$

You should also notice that the 2 at the beginning of the equation has been dropped. It was only a scaling factor and not important in determining the direction of the error function slope, which is the main key to the gradient descent algorithm. I do wish to congratulate my readers who have made it this far. Many folks have a very difficult time with the mathematics required to get to this stage.

It would now be very helpful to put a physical interpretation on this complex equation. The first part $$ left({t}_k - {o}_{k }
ight) $$ is just the error, which is easy to see. The sum expressions $$ {displaystyle sum_j}{w}_{j, k} * {o}_j $$ inside the sigmoid functions are the inputs into the k th final layer node. And the very last term o j is the output from the j th node in the hidden layer. Knowing this physical interpretation should make the creation of the other layer-to-layer error slope expressions much easier.

I state the input to hidden layer error slope equation without subjecting you to the rigorous mathematical derivation. This expression relies on the physical interpretation just presented. $$ frac{partial e}{partial {w}_{i, j}} = -left({e}_j
ight)kern0.5em *kern0.5em mathrm{sigmoid}left({displaystyle {sum}_i{w}_{i, j} * {o}_i}
ight)* left(1 - sigmoidleft({displaystyle {sum}_i{w}_{i, j} * {o}_i}
ight)
ight)*{o}_i $$

The next step is to demonstrate how new weights are calculated using the preceding error slope expressions . It is actually quite simple as shown in the following equation:
$$ new {w}_{j, k} = old {w}_{j, k} - alpha * frac{partial e}{partial {w}_{j, k}} $$
α= learning rate

Yes, that is exactly the same learning rate discussed in Chapter 2 where I introduced it as part of the linear predictor discussion. The learning rate is important because setting it too high may cause the gradient descent to miss the minimum, and setting it too low would cause many extra iterations and lessen the efficiency of the gradient descent algorithm.

Matrix Multiplications for Weight Change Determination

It would be very helpful to express all the preceding expressions in terms of matrices, which is the practical way real weight changes are calculated. Let the following expression represent one matrix element for the error slope expression between the hidden and the output layers: $$ mathrm{g}mathrm{d}left({w}_{j, k}
ight) = alpha * {e}_{k } * sigmoidleft({o}_k
ight) * left(1 - sigmoidleft({o}_k
ight)
ight) * {o}_j^T $$
o j T is the transpose of the hidden layer output matrix.

The following are the matrices for the three-layer, six-node example ANN: $$ left{egin{array}{ccc}hfill mathrm{gd}left({w}_{1,1}
ight)hfill & hfill mathrm{gd}Big({w}_{2,1Big)}hfill & hfill mathrm{gd}left({w}_{3,1}
ight)hfill \ {}hfill mathrm{gd}Big({w}_{1,2Big)}hfill & hfill mathrm{gd}Big({w}_{2,2Big)}hfill & hfill mathrm{gd}Big({w}_{3,2Big)}hfill end{array}
ight}*left{egin{array}{c}hfill {e}_1* ;sigmoi{d}_1 *left(1 - sigmoi{d}_1
ight)hfill \ {}hfill {e}_2 *  sigmoi{d}_2 *left(1 - sigmoi{d}_2
ight)hfill end{array}
ight} * left{{o}_1kern0.5em {o}_2
ight} $$
o 1 and o 2 are outputs from the hidden layer.

This completes all the necessary preparatory background in order to start updating the weights.

Worked-through Example

It’s important to go through a manual example before showing the Python approach, so that you truly understand the process when you run it as a Python script. Figure 8-23 is a slightly modified version of Figure 8-11, on which I have inserted arbitrary hidden node output values to have sufficient data to complete the example.

A436848_1_En_8_Fig23_HTML.jpg
Figure 8-23. Example ANN used for manual calculations

Let’s start by updating w 1,1, which is the weight-connecting node 1 in the hidden layer to node 1 in the output layer. Currently, it has a value of 2.0. This is the error slope equation used for these layer links:
$$ frac{partial e}{partial {w}_{j, k}} = -left({t}_k - {o}_k
ight)*mathrm{sigmoid}left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)* left(1 - sigmoidleft({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight)
ight)*{o}_j $$

Substitute the values, as shown in the following figure yields:

$$ left({t}_k - {o}_k
ight) $$ = e1 = 0.96

$$ left({displaystyle {sum}_j{w}_{j, k} * {o}_j}
ight) $$ = (2.0 * 0.6) + (3.0 * 0.4) = 2.4

sigmoid = $$ frac{1}{left(1 + {e}^{-2.4}
ight)} $$ = 0.9168

1 – sigmoid = 0.0832

o 1= 0.6

Multiply the applicable values with the negative sign yields:

  • –0.96 * 0.9168 * 0.0832 * 0.6 = –0.04394

Let’s assume a learning rate of 0.15, which is not too aggressive and the following is the new weight:

  • 2.0 – 0.15 * (–0.04394) = 2.0 + 0.0066 = 2.0066

This is not a large change from the original but you must keep in mind that there is hundreds, if not thousands of iterations performed before the global minimum is reached. Small changes rapidly accumulate to some rather large changes in the weights.

The other weights in the network can be adjusted in the same manner as demonstrated.

There are some important issues regarding how well an ANN can learn, which I discuss next.

Issues with ANN Learning

You should realize that not all ANNs learn well, just as not all people learn the same way. Fortunately for ANNs, it has nothing to do with intelligence but rather for more mundane items directly related to the sigmoid activation function. Figure 8-24 is a modified version of Figure 2-12 showing the input and output ranges for the sigmoid function .

A436848_1_En_8_Fig24_HTML.jpg
Figure 8-24. Annotated sigmoid function

Looking at Figure 8-24, you should see that if the x inputs are larger than 2.5, the y output has very small changes. This is because the sigmoid function asymptotically approaches 1.0 around that x value. Small changes for large input changes imply very small gradient changes happen. ANN learning becomes suppressed in this situation because the gradient descent algorithm depends upon a reasonable slope being present. Thus, ANN training sets should limit the input x values to what might be called a pseudo-linear range of roughly –3 to 3. Values of x outside this range causes a saturation effect for ANN learning, and no effective weight updates happen.

In a similar fashion, the sigmoid function cannot output values greater than one or less than zero. Output values in those ranges are not possible and weights must be appropriately scaled back such that the allowable output range is always maintained. In reality, the output range should be 0.01 to 0.99 because of the asymptotic nature described earlier.

Initial Weight Selection

Based on the issues just discussed, I believe you can probably realize that it is very important to select a good initial set of ANN weights so that learning can take effect without bumping into input saturation or output limit problems. The obvious choice is to constrain weight selection to the pseudo-linear range I mentioned earlier (i.e., ±3). Often, weights are further constrained to ±1 to be a bit more conservative.

There has been a “rule of thumb” developed over the years by AI researchers and mathematicians that roughly states:

The weights should be initially allocated using a normal distribution set at a mean value equal to the inverse of the square root of the number of nodes in the ANN.

For a 36-node, three-layer ANN, which I have previously used, the mean is $$ frac{1}{sqrt{36}} $$ or 0.16667. Figure 8-25 shows a normal probability distribution with this mean and ±2 approximate standard deviations are also clearly indicated.

A436848_1_En_8_Fig25_HTML.jpg
Figure 8-25. Normal distribution of initial weights for a 36-node ANN

A random selection of weights in the range of approximately –0.5 to 0.8333 would nicely provide an excellent starting point for ANN learning for a 36-node network.

Finally, you should avoid setting all weights to the same value because ANN learning depends upon an unequal weight distribution. And obviously, do not set all the weights to 0 because that completely disables the ANN.

This last section completes all of my background discussion on ANN. It is finally time to generate a full-fledged ANN on the Raspberry Pi using Python.

Demo 8-1: ANN Python Scripts

This first demonstration shows you how to create an untrained ANN using Python . I start by discussing the modules that constitute the ANN. Once I have done that, all the modules put into an operative package and the script run. The first module to discuss is the one that creates and initializes the ANN.

Initialization

This module’s structure depends largely on the type of ANN to be built. I am building a three-layer, nine-node ANN for this demonstration, which means there must objects representing each layer. In addition, the inputs, outputs, and weights must be created and appropriately labeled. Table 8-3 details the objects and references that are needed for this module.

Table 8-3. Initialization Module Objects and References

Name

Description

inode

Number of nodes in the input layer

hnode

Number of nodes in the hidden layer

onode

Number of nodes in the output layer

wtgih

Weight matrix between input and hidden layers

wtgho

Weight matrix between hidden and output layers

wij

Individual weight matrix element

input

Array for inputs

output

Array for outputs

ohidden

Array for hidden layer outputs

lr

Learning rate

The basic initialization module structure begins as follows:

def __init__ (self, inode, hnode, onode, lr):
    # Set local variables
    self.inode = inode
    self.hnode = hnode
    self.onode = onode
    self.lr = lr

You need to call the init module with the proper values for the ANN to be created. For a three-layer, nine-node network with a moderate learning rate, the values are as follows:

  • inode = 3

  • hnode = 3

  • onode = 3

  • lr = 0.25

The next item to discuss is how to create and initialize the key weighting matrices based on all the previous background discussions. I use a normal distribution for the weight generation with a mean of 0.1667 and a standard deviation of 0.3333. Fortunately, numpy has a very nice function that automates this process. The first matrix to create is the wtgih, whose dimensions are inode × hnode, or 3 × 3, for our example.

This next Python statement generates this matrix:

self.wtgih = np.random.normal(0.1667, 0.3333, self.hnodes, self.inodes)

The following is a sample output from an interactive session that was generated by the preceding statement:

>>>import numpy as np
>>>wtgih = np.random.normal(0.1667, 0.3333, [3, 3])
>>>wtgih
array([[ 0.44602141,  0.58021837,  0.00499487],
       [ 0.40433922, -0.31695922, -0.40410581],
       [ 0.63401073, -0.37218566,  0.14726115]])

The resulting matrix wtgih is well formed with excellent starting values.

At this point, the init module can be completed using the matrix generating statements shown earlier.

def __init__ (self, inode, hnode, onode, lr):
    # Set local variables
    self.inode = inode
    self.hnode = hnode
    self.onode = onode
    self.lr = lr


    # mean is the reciprocal of the sq root total nodes
    mean = 1/(pow((inode + hnode + onode), 0.5)


    # standard deviation (sd) is approximately 1/6 total weight range
    # total range = 2
    sd = 0.3333


    # generate both weighting matrices
    # input to hidden layer matrix
    self.wtgih = np.random.normal(mean, sd, (hnode, inode])


    # hidden to output layer matrix
    self.wtgho = np.random.normal(mean, sd, [onode, hnode])

At this point, I introduce a second module that allows some simple tests to run on the network created by the init module. This new module is named testNet to reflect its purpose. The takes an input data set or tuple in Python terms and returns an output set. The following process runs in the module:

  1. Input data tuple converted to an array.

  2. The array is multiplied by the wtgih weighting matrix. This is now the input to the hidden layer.

  3. This new array is then adjusted by the sigmoid function.

  4. The adjusted array from the hidden layer is multiplied by the wtgho matrix. This now the input to the output layer.

  5. This new array is then adjusted by the sigmoid function yielding the final output array.

The module listing follows:

def testNet(self, input):
    # convert input tuple to an array
    input = np.array(input, ndmin=2).T


    # multiply input by wtgih
    hInput = np.dot(self.wtgih, input)


    # sigmoid adjustment
    hOutput = 1/(1 + np.exp(-hInput))


    # multiply hidden layer output by wtgho
    oInput = np.dot(self.wtgho, hOutput)


    # sigmoid adjustment
    oOutput = 1/(1 + np.exp(-oInput))


    return oOutput

Test Run

Figure 8-26 shows an interactive Python session that I ran on a Raspberry Pi 3 to test this preliminary code.

A436848_1_En_8_Fig26_HTML.jpg
Figure 8-26. Interactive Python session

The init and testNet modules were both part of a class named ANN, which in turn were in a file named ANN.py. I first started Python and imported the class from the file so that the interpreter would recognize the class name. I next instantiated an object named ann with all nodes set to 3 and a learning rate equal to 0.3. The learning rate is not needed yet, but it must be present or you cannot instantiate an object. The act of instantiating an object automatically causes the init module to run. It is expecting size values for all three nodes and a learning rate value.

I next ran the testNet module with the three input values. These are shown in Table 8-4 along with the respective calculated output values. I also included the manually calculated error values.

Table 8-4. Initial Test

Input

Output

Error

Percent error

0.8

0.74993428

–0.05006572

6.3

0.5

0.52509703

0.02509703

5.0

0.6

0.60488966

0.00488966

0.8

The errors are not too much considering this is a totally untrained ANN. The next section discusses how to train an ANN to greatly improve its accuracy.

Demo 8-2: Training an ANN

In this demonstration, I show you how to train an ANN using a third module named trainNet, which has been added to the ANN class definition. This module functions in a very similar fashion to the testNet function by calculating an output set based on an input data set. However, the trainNet module input data is a predetermined training set instead of an arbitrary data tuple as I just demonstrated. This new module also calculates an error set by comparing the ANN outputs with its inputs and using the differences for training the network. The outputs are calculated in exactly the same manner as was done in testNet module. The arguments to trainNet now include both an input list and a training list. The following statements create these arrays from the list arguments:

def trainNet(self, inputT, train):
    # This module depends on the values, arrays and matrices
    # created when the init module is run.


    # create the arrays from the list arguments
    self.inputT = np.array(inputT, ndmin=2).T
    self.train = np.array(train, ndmin=2).T

The error is as stated before is the difference between the training set outputs and the actual outputs. The error equation for the k th output node as previously stated is:
$$ {e}_k = {t}_k - {o}_k $$

The matrix notation for the output errors is

self.eOutput = self.train - self.oOutput

The hidden layer error array in matrix notation for this example ANN is $$ mathrm{hError} = {left{egin{array}{ccc}hfill {w}_{1,1}hfill & hfill {w}_{1,2}hfill & hfill {w}_{1,3}hfill \ {}hfill {w}_{2,1}hfill & hfill {w}_{2,2}hfill & hfill {w}_{2,3}hfill \ {}hfill {w}_{3,1}hfill & hfill {w}_{3,2}hfill & hfill {w}_{3,3}hfill end{array}
ight}}^T*left{egin{array}{c}hfill {e}_1hfill \ {}hfill {e}_2hfill \ {}hfill {e}_3hfill end{array}
ight} $$

The following is the Python statement to generate this array:

self.hError = np.dot(self.wtgho.T, self.eOutput)

The following is the update equation for adjusting a link between the j th and k th layers, as previously shown:
$$ g dleft({w}_{j, k}
ight) = alpha * {e}_{k } * sigmoidleft({o}_k
ight) * left(1 - sigmoidleft({o}_k
ight)
ight) * {o}_j^T $$

The new gd(w j,k )array must be added to the original because these are adjustments to the original. The preceding equations can be neatly packaged into this single Python statement:

self.wtgho += self.lr * np.dot((self.eOutput * self.oOutputT * (1 - self.oOutputT)), self.hOutputT.T)

Writing the code for the weight updates between the input and hidden layers uses precisely the same format.

self.wtgih += self.lr * np.dot((self.hError * self.hOutputT * (1 - self.hOutputT)), self.inputT.T)

Putting all the preceding code segments together along with the previous modules produces the ANN.py listing. Note that I have included comments regarding the functions for each segment, along with additional debug statements.

import numpy as np

class ANN:

    def __init__ (self, inode, hnode, onode, lr):
        # set local variables
        self.inode = inode
        self.hnode = hnode
        self.onode = onode
        self.lr = lr


        # mean is the reciprocal of the sq root of the total nodes
        mean = 1/(pow((inode + hnode + onode), 0.5))


        # standard deviation is approximately 1/6 of total range
        # range = 2
        stdev = 0.3333


        # generate both weighting matrices
        # input to hidden layer matrix
        self.wtgih = np.random.normal(mean, stdev, [hnode, inode])
        print 'wtgih'
        print self.wtgih
        print


        # hidden to output layer matrix
        self.wtgho = np.random.normal(mean, stdev, [onode, hnode])
        print 'wtgho'
        print self.wtgho
        print


    def testNet(self, input):
        # convert input tuple to an array
        input = np.array(input, ndmin=2).T


        # multiply input by wtgih
        hInput = np.dot(self.wtgih, input)


        # sigmoid adjustment
        hOutput = 1/(1 + np.exp(-hInput))


        # multiply hidden layer output by wtgho
        oInput = np.dot(self.wtgho, hOutput)


        # sigmoid adjustment
        oOutput = 1/(1 + np.exp(-oInput))


        return oOutput

    def trainNet(self, inputT, train):

        # This module depends on the values, arrays and matrices
        # created when the init module is run.


        # create the arrays from the list arguments
        self.inputT = np.array(inputT, ndmin=2).T
        self.train = np.array(train, ndmin=2).T


        # multiply inputT array by wtgih
        self.hInputT = np.dot(self.wtgih, self.inputT)


        # sigmoid adjustment
        self.hOutputT = 1/(1 + np.exp(-self.hInputT))


        # multiply hidden layer output by wtgho
        self.oInputT = np.dot(self.wtgho, self.hOutputT)
        # sigmoid adjustment
        self.oOutputT = 1/(1 + np.exp(-self.oInputT))


        # calculate output errors
        self.eOutput = self.train - self.oOutputT


        # calculate hidden layer error array
        self.hError = np.dot(self.wtgho.T, self.eOutput)


        # update weight matrix wtgho
        self.wtgho += self.lr * np.dot((self.eOutput * self.oOutputT * (1 - self.oOutputT)), self.hOutputT.T)


        # update weight matrix wtgih
        self.wtgih += self.lr * np.dot((self.hError * self.hOutputT * (1 - self.hOutputT)), self.inputT.T)
        print 'updated wtgih'
        print wtgih
        print
        print 'updated wtgho'
        print wtgho
        print

Test Run

Figure 8-27 shows an interactive Python session where I instantiated a three-layer, nine-node ANN with a learning rate equal to 0.20.

A436848_1_En_8_Fig27_HTML.jpg
Figure 8-27. Interactive Python session

I have included several debug print statements in the ANN script, which allows for a direct comparison between the initial weight matrices and the updated ones. You should see there are only small changes between them, which is what was expected and desired. It is an important fact that the gradient descent works well using small increments to avoid missing the global minimum. I could only do one iteration in this session because the code was not set up for multiple iterations.

Congratulations for staying with me to this point! I covered a lot of topics concerning both ANN fundamentals and implementations. You should now be fully prepared to understand and appreciate the interesting practical ANN demonstrations presented in the next chapter.

Summary

This was the third of four chapters concerning artificial neural networks (ANNs). In this chapter, I focused on deep learning, which is really nothing more than the fundamentals and concepts behind multilayer ANNs.

After a brief review of some fundamentals, I completed a step-by-step manual calculation of a two-layer, six-node ANN. I subsequently redid the calculations using a Python script.

Next were the calculations for a larger, three-layer, nine-node ANN. Those calculations were done entirely with Python and matrix algorithms imported from the numpy library.

I discussed the errors that exist within an untrained ANN and how they are propagated. This was important to understand because it serves as the basis for the back propagation technique used to adjust weights in an effort to optimize the ANN.

Next, I went through a complete back propagation example, which illustrated how weights could be updated to reduce overall network errors. A sidebar followed wherein the gradient descent (GD) technique was introduced using a linear regression example.

A discussion involving the application of GD to an example ANN followed. The GD algorithm uses the slope of the error function in an effort to locate a global minimum, which is necessary to achieve a well-performing ANN.

I provided a complete example illustrating how to update weights using the GD algorithm. Issues with ANN learning and initial weight selection were discussed.

The chapter concluded with a thorough example of a Python script that initializes and trains any sized ANN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.90.246