Q-learning using a neural network

Now, we want to test the Q-learning algorithm using a smaller checkerboard environment and a neural network (with Keras). The main difference from the previous examples is that now, the state is represented by a screenshot of the current configuration; hence, the model has to learn how to associate a value with each input image and action. This isn't actual deep Q-learning (which is based on Deep Convolutional Networks, and requires more complex environments that we cannot discuss in this book), but it shows how such a model can learn an optimal policy with the same input provided to a human being. In order to reduce the training time, we are considering a square checkerboard environment, with four negative absorbing states and a positive final one:

import numpy as np

width = 5
height = 5
nb_actions = 4

y_final = width - 1
x_final = height - 1

y_wells = [0, 1, 3, 4]
x_wells = [3, 1, 2, 0] 

standard_reward = -0.1
tunnel_rewards = np.ones(shape=(height, width)) * standard_reward

for x_well, y_well in zip(x_wells, y_wells):
    tunnel_rewards[x_well, y_well] = -5.0

tunnel_rewards[x_final, y_final] = 5.0

A graphical representation of the rewards is shown in the following figure:

Rewards in the smaller checkerboard environment

As we want to provide the network with a graphical input, we need to define a function to create a matrix representing the tunnel:

import numpy as np

def reset_tunnel():
    tunnel = np.zeros(shape=(height, width), dtype=np.float32)

    for x_well, y_well in zip(x_wells, y_wells):
        tunnel[x_well, y_well] = -1.0

    tunnel[x_final, y_final] = 0.5
    
    return tunnel

The reset_tunnel() function sets all values equal to 0, except for (which is marked with -1) and the final state (defined by 0.5). The position of the agent (defined with the value 1) is directly managed by the training function. At this point, we can create and compile our neural network. As the problem is not very complex, we are employing an MLP:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()

model.add(Dense(8, input_dim=width * height))
model.add(Activation('tanh'))
          
model.add(Dense(4))
model.add(Activation('tanh'))

model.add(Dense(nb_actions))
model.add(Activation('linear'))

model.compile(optimizer='rmsprop',
              loss='mse')

The input is a flattened array, while the output is the Q function (all of the values corresponding to each action). The network is trained using RMSprop and a mean squared error loss function (our goal is to reduce the MSE between the actual value and the prediction). In order to train and query the network, it's helpful to create two dedicated functions:

import numpy as np

def train(state, q_value):
    model.train_on_batch(np.expand_dims(state.flatten(), axis=0), np.expand_dims(q_value, axis=0))

def get_Q_value(state):
    return model.predict(np.expand_dims(state.flatten(), axis=0))[0]

def select_action_neural_network(epsilon, state):
    Q_value = get_Q_value(state)
    
    if np.random.uniform(0.0, 1.0) < epsilon:
        return Q_value, np.random.randint(0, nb_actions)
    
    return Q_value, np.argmax(Q_value)

The behavior of these functions is straightforward. The only element that may be new to the reader is the use of the train_on_batch() method. Contrary to fit(), this function allows us to perform a single training step, given a batch of input-output couples (in our case, we always have a single couple). As our goal is finding an optimal path to the final state, starting from every possible cell, we are going to employ random starts:

import numpy as np

xy_grid = np.meshgrid(np.arange(0, height), np.arange(0, width), sparse=False)
xy_grid = np.array(xy_grid).T.reshape(-1, 2)

xy_final = list(zip(x_wells, y_wells))
xy_final.append([x_final, y_final])

xy_start = []

for x, y in xy_grid:
    if (x, y) not in xy_final:
        xy_start.append([x, y])
        
xy_start = np.array(xy_start)

def starting_point():
    xy = np.squeeze(xy_start[np.random.randint(0, xy_start.shape[0], size=1)])
    return xy[0], xy[1]

Now, we can define the functions needed to perform a single training step:

import numpy as np

def is_final(x, y):
    if (x, y) in zip(x_wells, y_wells) or (x, y) == (x_final, y_final):
        return True
    return False

def q_step_neural_network(epsilon, initial_state): 
    e = 0
    total_reward = 0.0
    
    (i, j) = starting_point()
    
    prev_value = 0.0
    tunnel = initial_state.copy()
    tunnel[i, j] = 1.0
    
    while e < max_steps:
        e += 1
        
        q_value, action = select_action_neural_network(epsilon, tunnel)
    
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
                
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
                
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
                
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
                
        reward = tunnel_rewards[x, y]
        total_reward += reward
        
        tunnel_n = tunnel.copy()
        tunnel_n[i, j] = prev_value
        tunnel_n[x, y] = 1.0
        
        prev_value = tunnel[x, y]
        
        if is_final(x, y):
            q_value[action] = reward
            train(tunnel, q_value)
            break
            
        else:
            q_value[action] = reward + (gamma * np.max(get_Q_value(tunnel_n)))
            train(tunnel, q_value)
            
            i = x
            j = y
            
            tunnel = tunnel_n.copy()
            
    return total_reward

The q_step_neural_network() function is very similar to the one defined in the previous example. The only difference is the management of the visual state. Every time there's a transition, the value 1.0 (denoting the agent) is moved from the old position to the new one, and the value of the previous cell is reset to its default (saved in the prev_value variable). Another secondary difference is the absence of α because there's already a learning rate set in the SGD algorithm, and it doesn't make sense to add another parameter to the model. We can now train the model for 10,000 iterations, with 7,500 explorative ones:

n_episodes = 10000
n_exploration = 7500

total_rewards = []

for t in range(n_episodes):
    tunnel = reset_tunnel()
    
    epsilon = 0.0
    
    if t <= n_exploration:
        epsilon = 1.0 - (float(t) / float(n_exploration))
        
    t_reward= q_step_neural_network(epsilon, tunnel)
    total_rewards.append(t_reward)

When the training process has finished, we can analyze the total rewards, in order to understand whether the network has successfully learned the Q functions:

Total rewards obtained by the neural network Q-learning algorithm

It's clear that the model is working well, because after the exploration period, the total reward becomes stationary around 4, with small oscillations due to the different path lengths (however, the final plot can be different because of the internal random state employed by Keras). To see a confirmation, let's generate the trajectories for all of the possible initial states, using the greedy policy (equivalent to ε = 0):

import numpy as np

trajectories = []
tunnels_c = []

for i, j in xy_start:
    tunnel = reset_tunnel()

    prev_value = 0.0

    trajectory = [[i, j, -1]]

    tunnel_c = tunnel.copy()
    tunnel[i, j] = 1.0
    tunnel_c[i, j] = 1.0

    final = False
    e = 0

    while not final and e < max_steps:
        e += 1

        q_value = get_Q_value(tunnel)
        action = np.argmax(q_value)

        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j

        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i

        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j

        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i

        trajectory[e - 1][2] = action
        trajectory.append([x, y, -1])

        tunnel[i, j] = prev_value

        prev_value = tunnel[x, y]

        tunnel[x, y] = 1.0
        tunnel_c[x, y] = 1.0

        i = x
        j = y

        final = is_final(x, y)
    
    trajectories.append(np.array(trajectory))
    tunnels_c.append(tunnel_c)
    
trajectories = np.array(trajectories)

Twelve random trajectories are shown in the following figure:

Twelve trajectories generated using the greedy policy

The agent always follows the optimal policy, independent from the initial state, and never ends up in a well. Even if the example is quite simple, it's helpful to introduce the reader to the concept of deep Q-learning (for further details, the reader can check the introductory paper, Deep Reinforcement Learning: An Overview, Li Y., arXiv:1701.07274 [cs.LG]).

In a general case, the environment can be a more complex game (like Atari or Sega), and the number of possible actions is very limited. Moreover, there's no possibility to employ random starts, but it's generally a good practice to skip a number of initial frames, in order to avoid a bias to the estimator. Clearly, the network must be more complex (involving convolutions to better learn the geometric dependencies), and the number of iterations must be extremely large. Many other tricks and specific algorithms can be employed in order to speed up the convergence, but due to a lack of space, they are beyond the scope of this book.

However, the general process and its logic are almost the same, and it's not difficult to understand why some strategies are preferable, and how the accuracy can be improved. As an exercise, I invite the reader to create more complex environments, with or without checkpoints and stochastic rewards. It's not surprising to see how the model will be able to easily learn the dynamics with a sufficiently large number of episodes. Moreover, as suggested in the Actor-Critic section, it's a good idea to use Tensorflow to implement such a model, comparing the performances with Q-learning.

Table of Contents for Q-learning using a neural network

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning using a neural network