Chapter 13

AI for Games – Become the Master at Snake

This is the last practical chapter; congratulations on finishing the previous ones! I hope you really enjoyed them. Now, let's leave aside business problems and self-driving cars. Let's have some fun by playing a popular game called Snake and making an AI that teaches itself to play this game!

That's exactly what we'll do in this chapter. The model we'll implement is called deep convolutional Q-learning, using a Convolutional Neural Network (CNN).

Our AI won't be perfect, and it won't fill in the entire map, but after some training it will start playing at a level comparable with humans.

Let's start tackling this problem by looking at what the game looks like and what the target is.

Problem to solve

First, let's have a look at the game itself:

Figure 1: The Snake game

Does that look somewhat familiar to you?

I'm pretty convinced that it will; everyone's played Snake at least once in their life.

The game is pretty simple; it consists of a snake and an apple. We control the snake and our aim is to eat as many apples as possible.

Sounds easy? Well, there's a small catch. Every time our snake eats an apple, our snake gets larger by one tile. This means that the game is unbelievably simple at the beginning, but it gets gradually harder, to the point where it becomes a strategic game.

Also, when controlling our snake, we can't hit ourselves, nor the borders of the board. This rather predictably results in us losing.

Now that we understand the problem, we can progress to the first step when creating an AI – building the environment!

Building the environment

This time, as opposed to some of the other practical sections in this book, we don't have to specify any variables or make any assumptions. We can just go straight to the three crucial steps present in every deep Q-learning project:

  1. Defining the states
  2. Defining the actions
  3. Defining the rewards

Let's begin!

Defining the states

In every previous example, our states were a 1D vector that represented some values that define the environment. For example, for our self-driving car we had the information gathered from the three sensors around the car and the car's position. All of these were put into a single 1D array.

But what if we want to make something slightly more realistic? What if we want the AI to see and gather information from the same source as we do? Well, that's what we'll do in this chapter. Our AI will see exactly the same board as we see when playing Snake!

The state of the game should be a 2D array representing the board of the game, exactly the same thing that we can see.

There's just one problem with this solution. Take a look at the following image, and see if you can answer the question: which way is our snake moving right now?

Figure 2: The Snake game

If you said "I don't know," then you're exactly right.

Based on a single frame, we can't tell which way our snake is going. Therefore, we'll need to stack multiple images, and then input all of them at once to a Convolutional Neural Network. This will result in us having 3D states rather than 2D ones.

So, just to recap:

Figure 3: The AI vision

We'll have a 3D array, containing next game frames stacked on top of each other, where the top one is the latest frame obtained from our game. Now, we can clearly see which way our AI is moving; in this case it's going up, toward the apple.

Now that we have defined states, we can go the next step: defining the actions!

Defining the actions

When we play Snake on a phone or a website, there are four actions available for us to take:

  1. Go up
  2. Go down
  3. Go right
  4. Go left

However, if the action we take would require the snake to make a 180° turn directly back on itself, then the game blocks this action and the snake continues going in its current direction.

In the preceding example, if we were to select action 2 – go down–our snake would still continue going up, because going down is impossible as the snake can't make a 180° turn directly back on itself.

It's worth noting that all of these actions are relative to the board, not the snake; they're not affected by the current movement of the snake. Going up, down, right, or left always means going up, down, right, or left with respect to the board, not to the snake's current direction of movement.

Alright, so right now you might be in one of these two groups when it comes to deciding what actions we model in our AI:

  1. We can use these four same actions for our AI.
  2. We can't use these same actions, because blocking certain moves will be confusing for our AI. Instead, we should invent a way to tell the snake to go left, go right, or keep going.

We actually can use these same actions for our AI!

Why won't it be confusing for our agent? That's because as long as our AI agent gets rewards for the actions it chose, and not for the action ultimately performed by the snake, then deep Q-learning will work and our AI will understand that in the example above choosing either go up or go down results in the same outcome.

For example, let's say that the AI-controlled snake is currently going left. It chooses action 3, go right; and because that would cause the snake to make a 180° turn back on itself, instead the snake continues going left. Let's say that action means the snake crashes into the wall and, as a result, dies. In order for this not to be confusing for our agent, all we need to do is tell it that the action of go right caused it to crash, even though the snake kept moving left.

Think of it as teaching an AI to play with the actual buttons on a phone. If you keep trying to make your snake double back on itself when it's moving left, by pressing the go right button over and over again, the game will keep ignoring the impossible move you keep telling it to do, keep going left, and eventually crash. That's all the AI needs to learn.

This is because, remember, in deep Q-learning we only update the Q-values of the action that the AI takes. If our snake is going left, and the AI decides to go right and the snake dies, it needs to understand that the action of go right caused it to get the negative reward, not the fact that the snake moved left; even though choosing the action go left would cause the same outcome.

I hope you understand that the AI can use the same actions as we use when we play. We can continue to the next, final step – defining the rewards!

Defining the rewards

This last step is pretty simple; we just need three rewards:

  1. Reward for eating an apple
  2. Reward for dying
  3. The living penalty

The first two are hopefully easy to understand. After all, we want to encourage our agent to eat as many apples as possible and therefore we will set its reward to be positive. To be precise: eating an apple = +2

Meanwhile, we want to discourage our snake from dying. That's why we set that reward to be a negative one. To be precise: dying = -1

Then comes the final reward: the living penalty.

What is that, and why is it necessary? We have to convince our agent that collecting apples as quickly as possible, without dying, is a good idea. If we were to only have the two rewards we've already defined, our agent would simply travel around the entire map, hoping that at some point it finds an apple. It wouldn't understand that it needs to collect apples as quickly as it can.

That's why we introduce the living penalty. It will slightly punish our AI for every action it takes, unless this action leads to dying or collecting an apple. This will show our agent that it needs to collect apples quickly, as only moves that collect an apple lead to gaining a positive reward. So, how big this reward should be? Well, we don't want to punish it too much. To be precise: living penalty =-0.03

If you want to tinker with these rewards, the absolute value of this reward should always be relatively small compared to the other rewards, for dying (-1) and collecting an apple (+2).

AI solution

As always, the AI solution for deep Q-learning consists of two parts:

  1. Brain – the neural network that will learn and take actions
  2. Experience replay memory – the memory that will store our experience; the neural network will learn from this memory

Let's tackle those now!

The brain

This part of the AI solution will be responsible for teaching, storing, and evaluating our neural network. To build it, we're going to use a CNN!

Why a CNN? When explaining the theory behind them, I mentioned that they're often used when "our environment as state returns images," and that's exactly what we're dealing with here. We've already established that the game state is going to be a stacked 3D array containing the last few game frames.

In the previous chapter, we discussed that a CNN takes a 2D image as input, not a stacked 3D array of images; but do you remember this graphic?

https://lh5.googleusercontent.com/qjfDY_d7Dvn92gkZ2KDpPAoy-SM_7AO8RExLTjtj-FYCQcCDVIrfSjvgslPBBT5kAneqJMRbJKAOikeslS-1T5TQaPDDxX338ko4DWQxi5xPggLbosb-p3tR8y5DDGp-blxs1aqj

Figure 4: RGB images

Here, I informed you that the RGB images are represented by 3D arrays that contain every single 2D channel of this image. Does that sound familiar? We can use the very same method for our problem. Just like each color in the RGB structure, we'll simply input every game frame as a new channel, which will give us a 3D array, which we will be able to input into a CNN.

In reality, CNNs usually only support 3D arrays as inputs. In order to input a 2D array, you need to create a fake single channel that transforms a 2D array into a 3D one.

When it comes to the CNN architecture, we'll have two convolution layers separated by a pooling layer. One convolution layer will have 32 3x3 filters, and the other one will have 64 2x2 filters. The pooling layer will shrink the size by 2, as the pooling window size will be 2x2. Why such an architecture? It's a classic one, found in many research papers, which I arbitrarily chose as common practice and which turned out to work brilliantly.

Our neural network will have one hidden layer with 256 neurons, and an output layer with 4 neurons; one for each of our possible outcome actions.

We also need to set two last parameters for our CNN – learning rate and input shape.

Learning rate, which was used in the previous examples, is a parameter that specifies by how much we update the weights in the neural network. Too small and it won't learn, too big and it won't learn for a different reason; the changes will be too big for any optimization. I found through experimentation that a good learning rate for this example is 0.0001.

We've already agreed that the input should be a 3D array containing last frames obtained from our game. To be exact, we will not be reading pixels from our screen. Instead, we'll read the direct 2D array that represents our game's screen at a particular time.

As you've probably noticed, our game is built on a grid. In the example we are using, the grid is 10x10. Then, inside the environment is an array with the same size (10x10), telling us mathematically what the board looks like. For example, if we have part of the snake in one cell, then we place the value 0.5 in the corresponding cell in our 2D array, which we will read. An apple is described as value 1 in this array.

Now that we know how we'll see one frame, we need to decide how many previous frames we'll use when we describe the current game state. 2 should be enough, since we can discern from that which way the snake is going, but to make sure, we'll have 4.

Can you tell me exactly what shape our input to the CNN will be?

It'll be 10x10x4, which gives us a 3D array!

The experience replay memory

As defined in the theoretical chapter of deep Q-learning, we need to have a memory that stores experience gathered during training.

We'll store the following data:

  • Current state – The game state the AI was in when it performed an action (what we inputted to our CNN)
  • Action – Which action was undertaken
  • Reward – The reward gained by performing this action on the current state
  • Next state – What happened (how the state looked) after performing the action
  • Game over – Information about whether we have lost or not

Also, we always have to specify two parameters for every experience replay memory:

  • Memory size – The maximum size of our memory
  • Gamma – The discount factor, existent in the Bellman equation

We'll set the memory size to 60,000 and the gamma parameter to 0.9.

There's one last thing to specify here.

I told you that our AI will learn from this memory, and that's true; but the AI won't be learning from the entire memory. Rather, it will learn from a small batch taken from it. The parameter that specifies this size will be called batch size, and in this example, we'll set its value to 32. That means that our AI will learn every iteration from a batch of this size taken from experience replay memory.

Now that you understand everything you have to code, you can get started!

Implementation

You'll implement the entire AI code and the Snake game in five files:

  1. environment.py file – The file containing the environment (Snake game)
  2. brain.py file – The file in which we build our CNN
  3. DQN.py – The file that builds the Experience Replay Memory
  4. train.py – The file where we will train our AI to play Snake
  5. test.py – The file where we will test our AI to see how well it performs

You can find all of them on the GitHub page along with a pre-trained model. To get there, select Chapter 13 folder on the main page.

We'll go through each file in the same order. Let's start building the environment!

Step 1 – Building the environment

Start this first, important step by importing the libraries you'll need. Like this:

# Importing the libraries   #4
import numpy as np   #5
import pygame as pg   #6

You'll only use two libraries: NumPy and PyGame. The former is really useful when dealing with lists or arrays, and the latter will be used to build the entire game – to draw the snake and the apple, and update the screen.

Now, let's create the Environment class which will contain all the information, variables and methods that you need for your game. Why a class? This is because it makes things easier for you later on. You'll be able to call specific methods or variables from the object of this class.

The first method that you always have to have is the __init__ method, always called when a new object of this class is created in the main code. To create this class along with this __init__ method, you need to write:

# Initializing the Environment class   #8
class Environment():   #9
    #10
    def __init__(self, waitTime):   #11
        #12
        # Defining the parameters   #13
        self.width = 880            # width of the game window   #14
        self.height = 880           # height of the game window   #15
        self.nRows = 10             # number of rows in our board   #16
        self.nColumns = 10          # number of columns in our board   #17
        self.initSnakeLen = 2       # initial length of the snake   #18
        self.defReward = -0.03      # reward for taking an action - The Living Penalty   #19
        self.negReward = -1.        # reward for dying   #20
        self.posReward = 2.         # reward for collecting an apple   #21
        self.waitTime = waitTime    # slowdown after taking an action   #22
        #23
        if self.initSnakeLen > self.nRows / 2:   #24
            self.initSnakeLen = int(self.nRows / 2)   #25
        #26
        self.screen = pg.display.set_mode((self.width, self.height))   #27
        #28
        self.snakePos = list()   #29
        #30
        # Creating the array that contains mathematical representation of the game's board   #31
        self.screenMap = np.zeros((self.nRows, self.nColumns))   #32
        #33
        for i in range(self.initSnakeLen):   #34
            self.snakePos.append((int(self.nRows / 2) + i, int(self.nColumns / 2)))   #35
            self.screenMap[int(self.nRows / 2) + i][int(self.nColumns / 2)] = 0.5   #36
            #37
        self.applePos = self.placeApple()   #38
        #39
        self.drawScreen()   #40
        #41
        self.collected = False   #42
        self.lastMove = 0   #43

You create a new class, the Environment() class, along with its __init__ method. This method only takes one argument, which is waitTime. Then after defining the method, create a list of constants, each of which is explained in the inline comments. After that, you perform some initialization. You make sure the snake is half the length of the screen or less on lines 24 and 25, and set the screen up on line 27. One important thing to note is that you create the screenMap array on line 32, which represents the board more mathematically. 0.5 in a cell means that this cell is taken by the snake, and 1 in a cell means that this cell is taken by the apple.

On lines 34 to 36, you place the snake in the middle of the screen, facing upward, and then in the remaining lines you place an apple using the placeapple() method (which we are about to define), draw the screen, set that the apple hasn't been collected, and set that there's no last move.

That's the very first method completed. Now you can proceed to the next one:

    # Building a method that gets new, random position of an apple
    def placeApple(self):
        posx = np.random.randint(0, self.nColumns)
        posy = np.random.randint(0, self.nRows)
        while self.screenMap[posy][posx] == 0.5:
            posx = np.random.randint(0, self.nColumns)
            posy = np.random.randint(0, self.nRows)
        
        self.screenMap[posy][posx] = 1
        
        return (posy, posx)

This short method places an apple in a new, random spot in your screenMap array. You'll need this method when our snake collects the apple and a new apple needs to be placed. It also returns the random position of the new apple.

Then, you'll need a function that draws everything for you to see:

    # Making a function that draws everything for us to see
    def drawScreen(self):
        
        self.screen.fill((0, 0, 0))
        
        cellWidth = self.width / self.nColumns
        cellHeight = self.height / self.nRows
        
        for i in range(self.nRows):
            for j in range(self.nColumns):
                if self.screenMap[i][j] == 0.5:
                    pg.draw.rect(self.screen, (255, 255, 255), (j*cellWidth + 1, i*cellHeight + 1, cellWidth - 2, cellHeight - 2))
                elif self.screenMap[i][j] == 1:
                    pg.draw.rect(self.screen, (255, 0, 0), (j*cellWidth + 1, i*cellHeight + 1, cellWidth - 2, cellHeight - 2))
                    
        pg.display.flip()

As you can see, the name of this method is drawScreen and it doesn't take any arguments. Here you simply empty the entire screen, then fill it in with white tiles where the snake is and with a red tile where the apple is. At the end, you update the screen with pg.display.flip().

Now, you need a function that will update the snake's position and not the entire environment:

    # A method that updates the snake's position
    def moveSnake(self, nextPos, col):
        
        self.snakePos.insert(0, nextPos)
        
        if not col:
            self.snakePos.pop(len(self.snakePos) - 1)
        
        self.screenMap = np.zeros((self.nRows, self.nColumns))
        
        for i in range(len(self.snakePos)):
            self.screenMap[self.snakePos[i][0]][self.snakePos[i][1]] = 0.5
        
        if col:
            self.applePos = self.placeApple()
            self.collected = True
            
        self.screenMap[self.applePos[0]][self.applePos[1]] = 1

You can see that this new method takes two arguments: nextPos and col. The former tells you where the head of the snake will be after performing a certain action. The latter will inform you whether the snake has collected an apple by taking this action, or not. Remember that if the snake has collected an apple, then the length of the snake increases by 1. If you go deep into this code, you can see that, but we won't go into detail here since it's not so relevant for the AI. You can also see that if the snake has collected an apple, a new one is spawned in a new spot.

Now, let's move on to the most important part of this code. You define a function that will update the entire environment. It will move your snake, calculate the reward, check if you lost, and return a new game frame. This is how it starts:

    # The main method that updates the environment
    def step(self, action):
        # action = 0 -> up
        # action = 1 -> down
        # action = 2 -> right
        # action = 3 -> left
        
        # Resetting these parameters and setting the reward to the living penalty
        gameOver = False
        reward = self.defReward
        self.collected = False
        
        for event in pg.event.get():
            if event.type == pg.QUIT:
                return
        
        snakeX = self.snakePos[0][1]
        snakeY = self.snakePos[0][0]
        
        # Checking if an action is playable and if not then it is changed to the playable one
        if action == 1 and self.lastMove == 0:
            action = 0
        if action == 0 and self.lastMove == 1:
            action = 1
        if action == 3 and self.lastMove == 2:
            action = 2
        if action == 2 and self.lastMove == 3:
            action = 3

As you can see, this method is called step and it takes one argument: the action that tells you which way you want the snake to be going. Just beneath the method's definition, in the comments, you can see which action means which direction.

Then you reset some variables. You set gameOver to False as this bool variable will tell you if you lost after performing this action. You set reward to defReward, as this is the living penalty; it can change if we collect an apple or die later.

Then there's a for loop. It's there to make sure the PyGame window doesn't freeze; this is a requirement of the PyGame library. It just has to be there.

snakeX and snakeY tell you what the head position of the snake is. It'll be used by the algorithm later, to determine what happens after the head moves.

In the last few lines, you can see the algorithm that blocks impossible actions. Just to recap, an impossible action is the one that requires the snake to make a 180° turn in place. lastMove tells you which way the snake is going right now, and is compared with action. If these lead to a contradiction, then action is set to lastMove.

Still inside this method, you update the snake position, check for game over, and calculate the reward, like so:

        # Checking what happens when we take this action
        if action == 0:
            if snakeY > 0:
                if self.screenMap[snakeY - 1][snakeX] == 0.5:
                    gameOver = True
                    reward = self.negReward
                elif self.screenMap[snakeY - 1][snakeX] == 1:
                    reward = self.posReward
                    self.moveSnake((snakeY - 1, snakeX), True)
                elif self.screenMap[snakeY - 1][snakeX] == 0:
                    self.moveSnake((snakeY - 1, snakeX), False)
            else:
                gameOver = True
                reward = self.negReward

Here you check what happens if the snake goes up. If the head of the snake is already in the top row (row no. 0) then you've obviously lost, since the snake hits the wall. So, reward is set to negReward and gameOver is set to True. Otherwise, you check what lies ahead of the snake.

If the cell ahead already contains part of the snake's body, then you've lost. You check that in the first if statement, then set gameOver to True and reward to negReward.

Else if the cell ahead is an apple, then you set reward to posReward. You also update the snake's position by calling the method you created just before this one.

Else if the cell ahead is empty, then you don't update reward in any way. You call the same method again, but this time with the col argument set to False, since the snake hasn't collected an apple. You go through the same process for every other action. I won't go through every line, but have a look at the code:

        elif action == 1:
            if snakeY < self.nRows - 1:
                if self.screenMap[snakeY + 1][snakeX] == 0.5:
                    gameOver = True
                    reward = self.negReward
                elif self.screenMap[snakeY + 1][snakeX] == 1:
                    reward = self.posReward
                    self.moveSnake((snakeY + 1, snakeX), True)
                elif self.screenMap[snakeY + 1][snakeX] == 0:
                    self.moveSnake((snakeY + 1, snakeX), False)
            else:
                gameOver = True
                reward = self.negReward
                
        elif action == 2:
            if snakeX < self.nColumns - 1:
                if self.screenMap[snakeY][snakeX + 1] == 0.5:
                    gameOver = True
                    reward = self.negReward
                elif self.screenMap[snakeY][snakeX + 1] == 1:
                    reward = self.posReward
                    self.moveSnake((snakeY, snakeX + 1), True)
                elif self.screenMap[snakeY][snakeX + 1] == 0:
                    self.moveSnake((snakeY, snakeX + 1), False)
            else:
                gameOver = True
                reward = self.negReward 
        
        elif action == 3:
            if snakeX > 0:
                if self.screenMap[snakeY][snakeX - 1] == 0.5:
                    gameOver = True
                    reward = self.negReward
                elif self.screenMap[snakeY][snakeX - 1] == 1:
                    reward = self.posReward
                    self.moveSnake((snakeY, snakeX - 1), True)
                elif self.screenMap[snakeY][snakeX - 1] == 0:
                    self.moveSnake((snakeY, snakeX - 1), False)
            else:
                gameOver = True
                reward = self.negReward

Simply handle every single action in the same way you did with the action of going up. Check if the snake didn't hit the walls, check what lies ahead of the snake and update the snake's position, reward, and gameOver accordingly.

There are two more steps in this method; let's jump straight into the first one:

        # Drawing the screen, updating last move and waiting the wait time specified
        self.drawScreen()
        
        self.lastMove = action
        
        pg.time.wait(self.waitTime)

You update our screen by drawing the snake and the apple on it, then change lastMove to action, since your snake has already moved and now it's moving in the action direction.

The last step in this method is to return what the game looks like now, what the reward is that was obtained, and whether you've lost, like this:

        # Returning the new frame of the game, the reward obtained and whether the game has ended or not
        return self.screenMap, reward, gameOver

screenMap gives you the information you need about what the game looks like after performing an action, reward gives you the collected reward from taking this action, and gameOver tells you whether you lost or not.

That's it for this method! To have a complete Environment class, you only need to make a function that will reset the environment, like this reset method:

    # Making a function that resets the environment
    def reset(self):
        self.screenMap  = np.zeros((self.nRows, self.nColumns))
        self.snakePos = list()
        
        for i in range(self.initSnakeLen):
            self.snakePos.append((int(self.nRows / 2) + i, int(self.nColumns / 2)))
            self.screenMap[int(self.nRows / 2) + i][int(self.nColumns / 2)] = 0.5
        
        self.screenMap[self.applePos[0]][self.applePos[1]] = 1
        
        self.lastMove = 0

It simply resets the game board (screenMap), as well as the snake's position, to the default, which is the middle of the board. It also sets the apple's position to the same as it was in the last round.

Congratulations! You've just finished building the environment. Now, we'll proceed to the second step, building the brain.

Step 2 – Building the brain

This is where you'll build our brain with a Convolutional Neural Network. You'll also set some parameters for its training and define a method that loads a pre-trained model for testing.

Let's begin!

As always, you start by importing the libraries that you'll use, like this:

# Importing the libraries
import keras
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from keras.optimizers import Adam

As you've probably noticed, all of the classes are a part of the Keras library, which is the one you're going to use in this chapter. Keras is actually the only library that you'll use in this file. Let's go through each of these classes and methods right now:

  1. Sequential – A class that allows you to initialize a neural network, and defines the general structure of this network.
  2. load_model – A function that loads a model from a file.
  3. Dense – A class to create fully connected layers in an Artificial Neural Network (ANN).
  4. Dropout – A class that adds dropout to our network. You've seen it used already, in Chapter 8, AI for Logistics – Robots in a Warehouse.
  5. Conv2D – A class that builds convolution layers.
  6. MaxPooling2D – A class that builds max pooling layers.
  7. Flatten – A class that performs flattening, so that you'll have an input for a classic ANN.
  8. Adam – An optimizer, which will optimize your neural network. It's used when training the CNN.

Now you've imported your library, you can continue by creating a class called Brain, where all these classes and methods are used. Start by defining a class and the __init__ method, like this:

# Creating the Brain class
class Brain():
    
    def __init__(self, iS = (100,100,3), lr = 0.0005):
        
        self.learningRate = lr
        self.inputShape = iS
        self.numOutputs = 4
        self.model = Sequential()

You can see that the __init__ method takes two arguments: iS (input shape) and lr (learning rate). Then you define some variables that will be associated with this class: learningRate, inputShape, numOutputs. Set numOutputs to 4, as this is how many actions our AI can take. Then, in the last line, create an empty model. To do this, use the Sequential class, which we imported earlier.

Doing this will allow you to add all the layers that you need to the model. That's exactly what you do with these lines:

       # Adding layers to the model   #20
        self.model.add(Conv2D(32, (3,3), activation = 'relu', input_shape = self.inputShape))   #21
        #22
        self.model.add(MaxPooling2D((2,2)))   #23
        #24
        self.model.add(Conv2D(64, (2,2), activation = 'relu'))   #25
        #26
        self.model.add(Flatten())   #27
        #28
        self.model.add(Dense(units = 256, activation = 'relu'))   #29
        #30
        self.model.add(Dense(units = self.numOutputs))   #31

Let's break this code down into lines:

Line 21: You add a new convolution layer to your model. It has 32 3x3 filters with the ReLU activation function. You need to specify the input shape here as well. Remember that the input shape is one of the arguments of this function, and is saved under the inputShape variable.

Line 23: You add a max pooling layer. The window's size is 2x2, which will shrink our feature maps in size by 2.

Line 25: You add the second convolution layer. This time it has 64 2x2 filters, with the same ReLU activation function. Why ReLU this time? I tried some other activation functions experimentally, and it turned out that for this AI ReLU worked the best.

Line 27: Having applied convolution, you receive new feature maps, which you flatten to a 1D vector. That's exactly what this line does – it flattens 2D images to a 1D vector, which you'll then be able to use as the input to your neural network.

Line 29: Now, you're in the full connection step – you're building the traditional ANN. This specific line adds a new hidden layer with 256 neurons and the ReLU activation function to our model.

Line 31: You create the last layer in your neural network – the output layer. How big is it? Well, it has to have as many neurons as there are actions that you can take. You put that value under the numOutputs variable earlier, and the value is equal to 4. You don't specify the activation function here, which means that the activation function will be linear as a default. It turns out that in this case, during training, using a linear output works better than a Softmax output; it makes the training more efficient.

You also have to compile your model. This will tell your code how to calculate the error, and which optimizer to use when training your model. You can do it with this single line:

        # Compiling the model
        self.model.compile(loss = 'mean_squared_error', optimizer = Adam(lr = self.learningRate))

Here, you use a method that's a part of the Sequential class (that's why you can use your model to call it) to do just that. The method is called compile and, in this case, takes two arguments. loss is a function that tells the AI how to calculate the error of your neural network; you'll use mean_squared_error. The second parameter is the optimizer. You've already imported the Adam optimizer, and you use it here. The learning rate for this optimizer was one of the arguments of the __init__ method of this class, and its value is represented by the learningRate variable.

There's only one step left to do in this class – make a function that will load a model from a file. You do it with this code:

    # Making a function that will load a model from a file
    def loadModel(self, filepath):
        self.model = load_model(filepath)
        return self.model

You can see that you've created a new function called loadModel, which takes one argument – filepath. This parameter is the file path to the pre-trained model. Once you've defined the function, you can actually load the model from this file path. To do so, you use the load_model method, which you imported earlier. This function takes the same argument – filepath. Then in the final line, you return the loaded model.

Congratulations! You've just finished building the brain.

Let's advance on our path, and build the experience replay memory.

Step 3 – Building the experience replay memory

You'll build this memory now, and later, you'll train your model from small batches of this memory. The memory will contain information about the game state before taking the action, the action that was taken, the reward gained, and the game state after performing the action.

I have some excellent news for you – do you remember this code?

# AI for Games - Beat the Snake game
# Implementing Deep Q-Learning with Experience Replay

# Importing the libraries
import numpy as np

# IMPLEMENTING DEEP Q-LEARNING WITH EXPERIENCE REPLAY

class Dqn(object):
    
    # INTRODUCING AND INITIALIZING ALL THE PARAMETERS AND VARIABLES OF THE DQN
    def __init__(self, max_memory = 100, discount = 0.9):
        self.memory = list()
        self.max_memory = max_memory
        self.discount = discount

    # MAKING A METHOD THAT BUILDS THE MEMORY IN EXPERIENCE REPLAY
    def remember(self, transition, game_over):
        self.memory.append([transition, game_over])
        if len(self.memory) > self.max_memory:
            del self.memory[0]

    # MAKING A METHOD THAT BUILDS TWO BATCHES OF INPUTS AND TARGETS BY EXTRACTING TRANSITIONS FROM THE MEMORY
    def get_batch(self, model, batch_size = 10):
        len_memory = len(self.memory)
        num_inputs = self.memory[0][0][0].shape[1]
        num_outputs = model.output_shape[-1]
        inputs = np.zeros((min(len_memory, batch_size), num_inputs))
        targets = np.zeros((min(len_memory, batch_size), num_outputs))
        for i, idx in enumerate(np.random.randint(0, len_memory, size = min(len_memory, batch_size))):
            current_state, action, reward, next_state = self.memory[idx][0]
            game_over = self.memory[idx][1]
            inputs[i] = current_state
            targets[i] = model.predict(current_state)[0]
            Q_sa = np.max(model.predict(next_state)[0])
            if game_over:
                targets[i, action] = reward
            else:
                targets[i, action] = reward + self.discount * Q_sa
        return inputs, targets

You'll use almost the same code, with only two small changes.

First, you get rid of this line:

        num_inputs = self.memory[0][0][0].shape[1]

And then change this line:

        inputs = np.zeros((min(len_memory, batch_size), num_inputs))

To this one:

        inputs = np.zeros((min(len_memory, batch_size), self.memory[0][0][0].shape[1],self.memory[0][0][0].shape[2],self.memory[0][0][0].shape[3]))

Why did you have to do this? Well, you got rid of the first line since you no longer have a 1D vector of inputs. Now you have a 3D array.

Then, if you look closely, you'll see that you didn't actually change inputs. Before, you had a 2D array, one dimension of which was batch size and the other of which was number of inputs. Now, things are very similar; the first dimension is once again the batch size, and the last three correspond to the size of the input as well!

Since our input is now a 3D array, you wrote .shape[1], .shape[2], and .shape[3]. What exactly are those shapes?

.shape[1] is the number of rows in the game (in your case 10). .shape[2] is the number of columns in the game (in your case 10). .shape[3] is the number of last frames stacked onto each other (in your case 4).

As you can see, you didn't really change anything. You just made the code work for our 3D inputs.

I also renamed this dqn.py file to DQN.py and renamed the class DQN to Dqn.

That's that! That was probably much simpler than most of you expected it to be.

You can finally start training your model. We'll do that in the next section – training the AI.

Step 4 – Training the AI

This is, by far, the most important step. Here we finally teach our AI to play Snake!

As always, start by importing the libraries you need:

# Importing the libraries
from environment import Environment
from brain import Brain
from DQN import Dqn
import numpy as np
import matplotlib.pyplot as plt

In the first three lines you import the tools that you created earlier, including the Brain, the Environment, and the experience replay memory.

Then, in the following two lines, you import the libraries that you'll use. These include NumPy and Matplotlib. You'll already recognize the former; the latter will be used to display your model's performance. To be specific, it will help you display a graph that, every 100 games, will show you the average number of apples collected.

That's all for this step. Now, define some hyperparameters for your code:

# Defining the parameters
memSize = 60000
batchSize = 32
learningRate = 0.0001
gamma = 0.9
nLastStates = 4

epsilon = 1.
epsilonDecayRate = 0.0002
minEpsilon = 0.05

filepathToSave = 'model2.h5'

I'll explain them in this list:

  1. memSize – The maximum size of your experience replay memory.
  2. batchSize – The size of the batch of inputs and targets that you get at each iteration from your experience replay memory for your model to train on.
  3. learningRate – The learning rate for your Adam optimizer in the Brain.
  4. gamma – The discount factor for your experience replay memory.
  5. nLastStates – How many last frames you save as your current state of the game. Remember, you'll input a 3D array of size nRows x nColumnsnLastStates to your CNN in the Brain.
  6. epsilon – The initial epsilon, the chance of taking a random action.
  7. epsilonDecayRate – By how much you decrease epsilon after every single game/epoch.
  8. minEpsilon – The lowest possible epsilon, after which it can't be adjusted any lower.
  9. filepathToSave – Where you want to save your model.

There you go – you've defined the hyperparameters. You'll use them later when you write the rest of the code. Now, you have to create an environment, a brain, and an experience replay memory:

# Creating the Environment, the Brain and the Experience Replay Memory
env = Environment(0)
brain = Brain((env.nRows, env.nColumns, nLastStates), learningRate)
model = brain.model
dqn = Dqn(memSize, gamma)

You can see that in the first line you create an object of the Environment class. You need to specify one variable here, which is the slowdown of your environment (wait time between moves). You don't want any slowdown during the training, so you input 0 here.

In the next line you create an object of the Brain class. It takes two arguments – the input shape and the learning rate. As I've mentioned multiple times, the input shape will be a 3D array of size nRows x nColumns x nLastStates, so that's what you type in here. The second argument is the learning rate, and since you've created a variable for that, you simply input the name of this variable – learningRate. After this line you take the model of this Brain class and create an instance of this model in your code. Keep things simple, and call it model.

In the last line you create an object of the Dqn class. It takes two arguments – the maximum size of the memory, and the discount factor for the memory. You've specified two variables, memSize and gamma, for just that, so you use them here.

Now, you need to write a function that will reset the states for your AI. You need it because the states are quite complicated, and resetting them in the main code would mess it up a lot. Here's what it looks like:

# Making a function that will initialize game states   #30
def resetStates():   #31
    currentState = np.zeros((1, env.nRows, env.nColumns, nLastStates))   #32
    #33
    for i in range(nLastStates):   #34
        currentState[:,:,:,i] = env.screenMap   #35
    #36
    return currentState, currentState   #37

Let's break it down into separate lines:

Line 31: You define a new function called resetStates. It doesn't take any arguments.

Line 32: You create a new array called currentState. It's full of zeros, but you may ask why it's 4D; shouldn't the input be 3D as we said? You're absolutely right, and it will be. The first dimension is called batch size and simply says how many inputs you input to your neural network at once. You'll only input one array at a time, so the first size is 1. The next three sizes correspond to the size of the input.

Lines 34-35: In a for loop, which will be executed nLastStates times, you set the board for each layer in your 3D state to the current, initial look of the game board from your environment. Every frame in your state will look the same initially, the same way the board of the game looks when you start a game.

Line 37: This function will return two currentStates. Why? This is because you need two game state arrays. One to represent the board before you've taken an action, and one to represent the board after you've taken an action.

Now you can start writing the code for the entire training. First, create a couple of useful variables, like this:

# Starting the main loop
epoch = 0
scores = list()
maxNCollected = 0
nCollected = 0.
totNCollected = 0

epoch will tell you which epoch/game you're in right now. scores is a list in which you save the average scores per game after every 100 games/epochs. maxNCollected tells you the highest score obtained so far in the training, while nCollected is the score in each game/epoch. The last variable, totNCollected, tells you how many apples you've collected over 100 epochs/games.

Now you start an important, infinite while loop, like this:

while True:
    # Resetting the environment and game states
    env.reset()
    currentState, nextState = resetStates()
    epoch += 1
    gameOver = False

Here, you iterate through every game, every epoch. That's why you restart the environment in the first line, create new currentState and nextState in the next line, increase epoch by one, and set gameOver to False as you obviously haven't lost yet.

Note that this loop doesn't end; therefore, the training never stops. We do it this way because we don't have a set goal for when to stop the training, since we haven't defined what a satisfactory result for our AI would be. We could calculate the average result, or a similar metric, but then training might take too long. I prefer to keep the training going and you can just stop the training whenever you want. A good time to stop is when the AI reaches an average of six apples per game, or you can even go up to 12 apples per game if you want better performance.

You've started the first loop that will iterate through every epoch. Now you need to create the second loop, where the AI performs actions, updates the environment, and trains your CNN. Start it with these lines:

    # Starting the second loop in which we play the game and teach our AI
    while not gameOver: 
        
        # Choosing an action to play
        if np.random.rand() < epsilon:
            action = np.random.randint(0, 4)
        else:
            qvalues = model.predict(currentState)[0]
            action = np.argmax(qvalues)

As I mentioned, this is the loop in which your AI makes decisions, moves, and updates the environment. You start off by initializing a while loop that will be executed as long as you haven't lost; that is, as long as gameOver is set to False.

Then, you can see if conditions. This is where your AI will make decisions. If a random value from range (0,1) is lower than the epsilon, then a random action will be performed. Otherwise, you predict the Q-values based on the current state of the game and from these Q-values you take the index with the highest Q-value. This will be the action performed by your AI.

Then, you have to update your environment:

        # Updating the environment
        state, reward, gameOver = env.step(action)

You use the step method from your Environment class object. It takes one argument, which is the action that you perform. It also returns the new frame obtained from your game after performing this action along with the reward obtained and the game over information. You'll use these variables soon.

Keep in mind, that this method returns a single 2D frame from your game. This means that you have to add this new frame to your nextState and remove the last one. You do this with these lines:

        # Adding new game frame to the next state and deleting the oldest frame from next state
        state = np.reshape(state, (1, env.nRows, env.nColumns, 1))
        nextState = np.append(nextState, state, axis = 3)
        nextState = np.delete(nextState, 0, axis = 3)

As you can see, first you reshape state because it is 2D, while both currentState and nextState are 4D. Then you add this new, reshaped frame to nextState along the 3rd axis. Why 3rd? That's because the 3rd index refers to the 4th dimension of this array, which keeps the 2D frames inside. In the last line you simply delete the first frame from nextState, which has index 0 (the oldest frames are kept on the lowest indexes).

Now, you can remember this transition in your experience replay memory, and train your model from a random batch of this memory. You do that with these lines:

        # Remembering the transition and training our AI
        dqn.remember([currentState, action, reward, nextState], gameOver)
        inputs, targets = dqn.get_batch(model, batchSize)
        model.train_on_batch(inputs, targets)

In the first line, you append this transition to the memory. It contains information about the game state before taking the action (currentState), the action that was taken (action), the reward gained (reward), and the game state after taking this action (nextState). You also remember the gameOver status. In the following two lines, you take a random batch of inputs and targets from your memory, and train your model on them.

Having done that, you can check if your snake has collected an apple and update currentState. You can do that with these lines:

        # Checking whether we have collected an apple and updating the current state
        if env.collected:
            nCollected += 1
        
        currentState = nextState

In the first two lines, you check whether the snake has collected an apple and if it has, you increase nCollected. Then you update currentState by setting its values to the ones of nextState.

Now, you can quit this loop. You still have a couple of things to do:

    # Checking if a record of apples eaten in a around was beaten and if yes then saving the model
    if nCollected > maxNCollected and nCollected > 2:
        maxNCollected = nCollected
        model.save(filepathToSave)
    
    totNCollected += nCollected
    nCollected = 0

You check if you've beaten the record for the number of apples eaten in a round (this number has to be bigger than 2) and if you did, you update the record and save your current model to the file path you specified before. You also increase totNCollected and reset nCollected to 0 for the next game.

Then, after 100 games, you show the average score, like this:

    # Showing the results each 100 games
    if epoch % 100 == 0 and epoch != 0:
        scores.append(totNCollected / 100)
        totNCollected = 0
        plt.plot(scores)
        plt.xlabel('Epoch / 100')
        plt.ylabel('Average Score')
        plt.savefig('stats.png')
        plt.close()

You have a list called scores, where you store the average score after 100 games. You append a new value to it and then reset this value. Then you show scores on a graph, using the Matplotlib library that you imported before. This graph is saved in stats.png every 100 games/epochs.

Then you lower the epsilon, like so:

    # Lowering the epsilon
    if epsilon > minEpsilon:
        epsilon -= epsilonDecayRate

With the if condition, you make sure that the epsilon doesn't go lower than the minimum threshold.

In the last line, you display some additional information about every single game, like this:

    # Showing the results each game
    print('Epoch: ' + str(epoch) + ' Current Best: ' + str(maxNCollected) + ' Epsilon: {:.5f}'.format(epsilon))

You display the current epoch (game), the current record for the number of apples collected in one game, and the current epsilon.

That's it! Congratulations! You've just built a function that will train your model. Remember that this training goes on infinitely until you decide it's finished. When you're satisfied with it, you'll want to test it. For that, you need a short file to test your model. Let's do it!

Step 5 – Testing the AI

This will be a very short section, so don't worry. You'll be running this code in just a moment!

As always, you start by importing the libraries you need:

# Importing the libraries
from environment import Environment
from brain import Brain
import numpy as np

This time you won't be using the DQN memory nor the Matplotlib library, and therefore you don't import them.

You also need to specify some hyperparameters, like this:

# Defining the parameters
nLastStates = 4
filepathToOpen = 'model.h5'
slowdown = 75

You'll need nLastStates later in this code. You also created a file path to the model that you'll test. Finally, there's also a variable that you'll use to specify the wait time after every move, so that you can clearly see how your AI performs.

Once again, you create some useful objects, like an Environment and a Brain:

# Creating the Environment and the Brain
env = Environment(slowdown)
brain = Brain((env.nRows, env.nColumns, nLastStates))
model = brain.loadModel(filepathToOpen)

Into the brackets of the Environment, you input the slowdown, because that's the argument that this class takes. You also create an object of the Brain class, but this time, you don't specify the learning rate, since you won't be training your model. In the final line you load a pre-trained model using the loadModel method from the Brain class. This method takes one argument, which is the file path from which you load the model.

Once again, you need a function to reset states. You can use the same one as before, so just copy and paste these lines:

# Making a function that will reset game states
def resetStates():
    currentState = np.zeros((1, env.nRows, env.nColumns, nLastStates))
    
    for i in range(nLastStates):
        currentState[:,:,:,i] = env.screenMap
   
    return currentState, currentState

Now, you can enter the main while loop like before. This time, however, you won't define any variables, since you don't need any:

# Starting the main loop
while True:
    # Resetting the game and the game states
    env.reset()
    currentState, nextState = resetStates()
    gameOver = False

As you can see, you've started this infinite while loop. Once again, you have to restart the environment, the states, and the game over, every iteration.

Now, you can enter the game's while loop, where you take actions, update the environment, and so on:

    # Playing the game
    while not gameOver: 
        
        # Choosing an action to play
        qvalues = model.predict(currentState)[0]
        action = np.argmax(qvalues)

This time, you don't need any if statements. After all, you're testing your AI, so you mustn't have any random actions here.

Once again, you update the environment:

        # Updating the environment
        state, _, gameOver = env.step(action)

You don't really care about the reward, so just place "_" instead of reward. The environment still returns the frame after taking an action, along with the information about game over.

Due to this fact, you need to reshape your state and update nextState in the same way as before:

        # Adding new game frame to next state and deleting the oldest one from next state
        state = np.reshape(state, (1, env.nRows, env.nColumns, 1))
        nextState = np.append(nextState, state, axis = 3)
        nextState = np.delete(nextState, 0, axis = 3)

In the final line, you need to update currentState as you did in the other file:

        # Updating current state
        currentState = nextState

That's the end of coding for this section! This isn't, however, the end of this chapter. You still have to run the code.

The demo

Unfortunately, due to PyGame not being supported by Google Colab, you'll need to use Anaconda.

Thankfully, you should have it installed after Chapter 10, AI for Autonomous Vehicles – Build a Self-Driving Car, so it'll be easier to install the required packages and libraries.

Installation

First, create a new virtual environment inside Anaconda. This time, I'll walk you through the installation on the Anaconda Prompt from a PC, so that you can all see how it's done from any system.

Windows users, please open the Anaconda Prompt on your PC, and Mac/Linux users, please open your Terminal on Mac/Linux. Then type:

conda create -n snake python=3.6

Just like so:

Then, hit Enter on your keyboard. You should get something more or less like this:

Type y on your keyboard and hit Enter once again. After everything gets installed, type this in your Anaconda Prompt:

conda activate snake

And hit Enter once again. Now on the left, you should see snake written instead of base. This means that you're in the newly created Anaconda environment.

Now you need to install the required libraries. The first one is Keras:

conda install -c conda-forge keras

After writing that, hit Enter. When you get this:

Type y once again and hit Enter once again. Once you have it installed, you need to install PyGame and Matplotlib.

The first one can be installed by entering pip install pygame, while the second one can be installed by entering pip install matplotlib. The installation follows the same procedure as you just took to install Keras.

Ok, now you can run your code!

If you've accidentally closed your Anaconda Prompt/Terminal for any reason, re-open it and type in this to activate the snake environment that we have just created:

conda activate snake

And then hit Enter. I got a bunch of warnings after doing this, and you may see similar warnings as well, but don't worry about them:

Now, you need to navigate this console to the folder that contains the file you want to run, in this case train.py. I recommend that you put all the code of Chapter 13 in one folder called Snake on your desktop. Then you'll be able to follow the exact instructions that I'll give you now. To navigate to this folder, you'll need to use cd commands.

First, navigate to the desktop by running cd Desktop, like this:

And then enter the Snake folder that you created. Just as with the previous command, run cd Snake, like this:

You're getting super close. To train a new model, you need to type:

python train.py

And hit Enter. This is more or less what you should see:

You have both a window on the left with the game, and one on the right with the terminal informing you about every game (every epoch).

Congratulations! You just smashed the code of this chapter and built an AI for Snake. Be patient with it though! Training it may take up a couple of hours.

So, what kind of results can you expect?

The results

Firstly, make sure to follow the results also on your Anaconda Prompt/Terminal, epoch by epoch. An epoch is one game played. After thousands of games (epochs), you'll see the score increase, as well as the snake size increase.

After thousands of epochs of training, while the snake doesn't fill in the entire map, your AI plays on a level comparable with humans. Here are some pictures after 25,000 epochs.

Figure 5: Results example 1

Figure 6: Results example 2

You'll also get a graph created in the folder (stats.png) showing the average score over the epochs. Here is the graph I got when training our AI over 25,000 epochs:

Figure 7: Average score over 25,000 epochs

You can see that our AI reached an average score of 10-11 per game. This isn't bad considering that before training it knew absolutely nothing about the game.

You can also see the same results if you run the test.py file using the pre-trained model model.h5 attached to this chapter in GitHub. To do this, you simply need to enter in your Anaconda Prompt/Terminal (still in the same Snake folder on your desktop that contains all the code of Chapter 13, and still inside the snake virtual environment):

python test.py

If you want to test your model after training, you simply need to replace model.h5 with model2.h5 in the test.py file. That's because during the training the weights of your AI's neural network will be saved into a file named model2.h5. Then re-enter python test.py in your Anaconda Prompt/Terminal, and enjoy your own results.

Summary

In this last practical chapter of the book, we built a deep convolutional Q-Learning model for Snake. Before we built anything, we had to define what our AI would see. We established that we needed to stack multiple frames, so that our AI would see the continuity of its moves. This was the input to our Convolutional Neural Network. The outputs were the Q-values corresponding to each of the four possible moves: going up, going down, going left, and going right. We rewarded our AI for eating an apple, punished it for losing, and punished it slightly for performing any action (the living penalty). Having run 25,000 games, we can see that our AI is able to eat 10-11 apples per game.

I hope you enjoyed it!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.19.130