Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12 Deep Convolutional Q-Learning

Now that you understand how Artificial Neural Networks (ANNs) work, you're ready to tackle an incredibly useful tool, mostly used when dealing with images—Convolutional Neural Networks (CNNs). To put it simply, CNNs allow your AI to see images in real time as if it had eyes.

We will tackle them in the following steps:

What are CNNs used for?
How do CNNs work?
Convolution
Max pooling
Flattening
Full connection

Once you've understood those steps, you'll understand CNNs, and how they can be used in deep convolutional Q-learning.

What are CNNs used for?

CNNs are mostly used with images or videos, and sometimes with text to tackle Natural Language Processing (NLP) problems. They are often used in object recognition, for example, predicting whether there is a cat or a dog in a picture or video. They are also often used with deep Q-learning (which we will discuss later on), when the environment returns 2D states of itself, for example, when we are trying to build a self-driving car that reads outputs from cameras around it.

Remember the example in Chapter 9, Going Pro with Artificial Brains - Deep Q-Learning, where we were predicting houses' prices. As inputs, we had all of the values that define a house (area, age, number of bedrooms, and so on), and as output, we had the price of a house. In the case of CNNs, things are very similar. For example, if we wanted to solve the same problem using CNNs, we would have images of houses as inputs and the price of a house as output.

This diagram should illustrate what I mean:

Figure 1: Input Image – CNN – Output Label

As you can see, the input is an image that flows through a CNN and comes out as an output. In the case of this diagram, the output is a class to which the image corresponds. What do I mean by a class? For example, if we wanted to predict whether the inputted image is a smiling face or a sad face, then one class would be smiling face, and the other would be sad face. Our output should then correctly decide to which class the input image corresponds.

Speaking of happy and sad faces, here's a diagram that represents it in more detail:

Figure 2: Two different classes to predict (Happy or Sad)

In the preceding example, we've run two images through a CNN. The first one is a smiling face and the other one is a sad face. As I mentioned before, our network predicts whether the image is a happy or a sad face.

I can imagine what you're thinking right now: how does it all work? What's inside this black box we call a CNN? I'll answer these questions in the following sections.

How do CNNs work?

Before we can go deep into the structure of CNNs, we need to understand a couple of points. I will introduce you to the first point with a question: how many dimensions does a colored RGB image have?

The answer may surprise you: it's 3!

Why? Because every RGB image is, in fact, represented by three 2D images, each one corresponding to a color in RGB architecture. So, there is one image corresponding to red, one corresponding to green, and one to blue. Grayscale images are only 2D, because they are represented by only one scale as there are no colors. The following diagram should make it clearer:

Figure 3: RGB versus black and white images

As you can see, a colored image is represented by a 3D array. Each color has its own layer in the picture, and this layer is called a channel. A grayscale (black and white) image only has one channel and is, therefore, a 2D array.

As you probably know, images are made out of pixels. Each of these is represented by a value that ranges from 0 to 255, where 0 is a pixel turned off and 255 is a fully bright pixel. It's important to understand that when we say that a pixel has the value (255, 255, 0), then that means this pixel is fully bright on the red and green channel and turned off on the blue channel.

From now on, to understand everything better, we'll be dealing with very simple images. In fact, our images will be grayscale (1 channel, 2D) and the pixels will either be fully bright or turned off. In order to make pictures easier to read, we'll assign 1 to a turned off pixel (black) and 0 to a fully bright one (white).

Going back to the case of happy and sad faces, this is what our 2D array representing a happy face would look like:

Figure 4: The pixel representation

As you can see, we have an array where 0 corresponds to a white pixel and 1 corresponds to a black pixel. The picture on the right is our smiling face represented by an array.

Now that we understand the foundations and that we've simplified the problem, we're ready to tackle CNNs. In order to fully understand them, we need to split our learning into the four steps that make up a CNN:

Convolution
Max pooling
Flattening
Full connection

Now we'll get to know each of these four steps one by one.

Step 1 – Convolution

This is the first crucial step of every CNN. In convolution, we apply something called feature detectors to the inputted image. Why do we have to do so? This is because all images contain certain features that define what is in the picture. For example, to recognize which face is sad and which one is happy, we need to understand the meaning of the shape of the mouth, which is a feature of this image. It's easier to understand this from a diagram:

Figure 5: Step 1 – Convolution (1/5)

In the preceding diagram, we applied a feature detector, also known as a filter, to the smiling face we had as input. As you can see, a filter is a 2D array with some values inside. When we apply this feature detector to the image it covers (in this case it is a 3 x 3 grid), we check how many pixels from this part of the image match the filter's pixels. Then we put this number into a new 2D array called feature map. In other words, the more a part of the picture matches the picture detector, the higher the number we put into the feature map.

Next, we slide the feature detector across the entire image. In the next iteration, this is what will happen:

Figure 6: Step 1 – Convolution (2/5)

As you can see, we slide the filter one place to the right. This time, one pixel matches in both the filter and in this part of the image. That's why we put 1 in the feature map.

What do you think happens when we hit the boundary of this image? What would you do? I'll show you what happens with these two diagrams:

Figure 7: Step 1 – Convolution (3/5)

Figure 8: Step 1 – Convolution (4/5)

Here, we had this exact situation: in the first image, our filter hits the boundary. It turns out that our feature detector simply jumps to the next line.

The whole magic of the convolution wouldn't work if we had only one filter. In reality, we use many filters, which produce many different feature maps. This set of feature maps is called a convolution layer, or convolutional layer. Here's a diagram to recap:

Figure 9: Step 1 – Convolution (5/5)

Here, we can see an input image to which many filters were applied. All together, they create a convolutional layer from many feature maps. This is the first step when building a CNN.

Now that we understand convolution, we can proceed to another important step—max pooling.

Step 2 – Max pooling

This step in CNNs is responsible for lowering the size of each feature map. When dealing with neural networks, we don't want to have too many inputs, otherwise our network wouldn't be able to learn properly because of the high complexity. Therefore, a method of reducing the size called max pooling needs to be introduced. It lets us reduce the size without losing any important features, and it makes features partially invariant to shifts (translations and rotations).

Technically, a max pooling algorithm is also based on an array sliding across the entire feature map. In this case, we are not searching for any features but, rather, for the maximum value in a specific area of a feature map.

Let me show you what I mean with this graphic:

Figure 10: Step 2 – Max pooling (1/5)

In this example, we're taking the feature map, obtained after the convolution step we had before, and then we are running it through max pooling. As you can see, we have a window of size 2 x 2 looking for the highest values in the part of feature map it covers. In this case, it's 1.

Can you tell what will happen in the next iteration?

As you may have suspected, this window will slide to the right, although in a slightly different way than before. It moves like this:

Figure 11: Step 2 – Max Pooling (2/5)

This window jumps its size to the right, which I hope you remember is different from the convolution step, where the feature detector slid one cell at a time. In this case, the highest value is 1 as well, and therefore we write 1 in the pooled feature map.

What happens this time when we hit the boundary of the feature map? Things look slightly different from before once again. This is what happens:

Figure 12: Step 2 – Max pooling (3/5)

The window crosses the boundary and searches for the highest value in the part of the feature map that is still inside the max pooling window. Yet again, the highest value is 1.

But what happens now? After all, there's no space left to go to the right. There's also only one row at the bottom left for max pooling. This is what the algorithm does:

Figure 13: Step 2 – Max pooling (4/5)

As we can see, it once again crosses the boundary and searches for the highest value in what is inside the window. In this case, it is 0. This process is repeated until the window hits the bottom right corner of the feature map. To recap what our CNN looks like for now, have a look at the following diagram:

Figure 14: Step 2 – Max pooling (5/5)

We had a smiling face as input, then we ran it through convolution to obtain many feature maps, called the convolutional layer. Now we've run all the feature maps through max pooling and obtained many pooled feature maps, all together called the pooling layer.

Now we can continue to the next step, which will let us input the pooling layer into a neural network. This step is called flattening.

Step 3 – Flattening

This is a very short step. As the name may suggest, we change all the pooled feature maps from 2D arrays to 1D ones. As I mentioned before, this will let us input the image into a neural network with ease. So, how exactly will we achieve this? The following diagram should help you understand:

Figure 15: Step 3 – Flattening (1/3)

Here we go back to the pooled feature map we obtained before. To flatten it, we take pixel values starting from the top left, finishing at bottom right. An operation like this returns a 1D array, containing the same values as the 2D array we started with.

But remember, we don't have one pooled feature map, we have an entire layer of them. What do you think we should do with that?

The answer is simple: we put this entire layer into a single 1D flattened array, one pooled feature map after another. Why does it have to be 1D? This is because ANNs only accept 1D arrays as their inputs. All the layers in a traditional neural network are 1D, which means that the input has to be 1D as well. Therefore, we flatten all the pooled feature maps, like so:

Figure 16: Step 3 – Flattening (2/3)

We've taken the entire layer and transformed it into a single flattened 1D array. We'll soon use this array as the input of a traditional neural network.

First, let's remind ourselves of what our model looks like now:

Figure 17: Step 3 – Flattening (3/3)

So, we have a Convolutional Layer, Pooling Layer, and a freshly added, flattened 1D layer. Now we can go back to a classic ANN, that is, a fully connected neural network, and treat this last layer as an input for this network. This leads us to the final step, full connection.

Step 4 – Full connection

The final step of creating a CNN is to connect it to a classic fully-connected neural network. Remember that we already have a 1D array telling us in a compressed way what the image looks like, so why not just use it as an input to a fully-connected neural network? After all, it's the latter that's able to make predictions.

That's exactly what we do next, just like this:

Figure 18: Step 4 – Full connection

After flattening, we input those returned values straight into the fully-connected neural network, which then yields the prediction—the output value.

You might be wondering how the back-propagation phase works now. In a CNN, back-propagation not only updates the weights in the fully-connected neural network, but also the filters used in the convolution step. The max pooling and flattening steps will remain the same, as there is nothing to update there.

In conclusion, CNNs look for some specific features. This is why they're mostly used when we are dealing with images, where searching for features is crucial. For example, when trying to recognize a sad and a happy face, a CNN needs to understand which mouth's shape means a sad face and which means a happy face. In order to obtain an output, a CNN has to run these steps:

Convolution – Applying filters to the input image. This operation will find the features our CNN is looking for and save them in a feature map.
Max pooling – Lowering the feature map size, by taking a maximum value in a given area and saving these values in a new array called pooled feature map.
Flattening – Changing the entire pooling layer (all pooled feature maps) to a 1D vector. This will allow us to input this vector into a neural network.
Full connection – Creating a neural network, which takes as input a flattened pooling layer and returns a value that we would like to predict. This last step lets us make predictions.

Deep convolutional Q-learning

In the chapter on deep Q-learning (Chapter 9, Going Pro with Artificial Brains – Deep Q-Learning), our inputs were vectors of encoded values defining the states of the environment. When working with images or videos, encoded vectors aren't the best inputs to describe a state (the input frame), simply because an encoded vector doesn't preserve the spatial structure of an image. The spatial structure is important because it gives us more information to help predict the next state, and predicting the next state is essential for our AI to learn the correct next move.

Therefore, we need to preserve the spatial structure. To do that, our inputs must be 3D images (2D for the array of pixels plus one additional dimension for the colors, as illustrated at the beginning of this chapter). For example, if we train an AI to play a video game, the inputs are simply the images of the screen itself, exactly what a human sees when playing the game.

Following this analogy, the AI acts like it has human eyes; it observes the input images on the screen when playing the game. Those input images go into a CNN (the eyes for a human), which detects the state in each image. Then they're forward-propagated through the pooling layers where max pooling is applied. Then the pooling layers are flattened into a 1D vector, which becomes the input of our deep Q-learning network (the exact same one as in Chapter 9, Going Pro with Artificial Brains – Deep Q-Learning). In the end, the same deep Q-learning process is run.

The following graph illustrates deep convolutional Q-learning applied to the famous game of Doom:

Figure 19: Deep convolutional Q-learning for Doom

In summary, deep convolutional Q-learning is the same as deep Q-learning, with the only differences being that the inputs are now images, and a CNN is added at the beginning of the fully-connected deep Q-learning network to detect the states of those images.

Summary

You've learned about another type of neural network—a Convolutional Neural Network.

We established that this network is used mostly with images and searches for certain features in these pictures. It uses three additional steps that ANNs don't have: convolution, where we search for features; max pooling, where we shrink the image in size; and flattening, where we flatten 2D images to a 1D vector so that we can input it into a neural network.

In the next chapter, you’ll build a deep convolutional Q-learning model to solve a classic gaming problem: Snake.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Deep Convolutional Q-Learning

Create new playlist

Sign In

Sign Up

12

Deep Convolutional Q-Learning