SARSA in the checkerboard environment

We can now test the SARSA algorithm in the original tunnel environment (all of the elements that are not redefined are the same as the previous chapter). The first step is defining the Q(s, a) array and the constants employed in the training process:

import numpy as np

nb_actions = 4

Q = np.zeros(shape=(height, width, nb_actions))

x_start = 0
y_start = 0

max_steps = 2000
alpha = 0.25

As we want to employ a ε-greedy policy, we can set the starting point to (0, 0), forcing the agent to reach the positive final state. We can now define the functions needed to perform a training step:

import numpy as np

def is_final(x, y):
if (x, y) in zip(x_wells, y_wells) or (x, y) == (x_final, y_final):
return True
return False

def select_action(epsilon, i, j):
if np.random.uniform(0.0, 1.0) < epsilon:
return np.random.randint(0, nb_actions)
return np.argmax(Q[i, j])

def sarsa_step(epsilon):
e = 0

i = x_start
j = y_start

while e < max_steps:
e += 1

action = select_action(epsilon, i, j)

if action == 0:
if i == 0:
x = 0
else:
x = i - 1
y = j

elif action == 1:
if j == width - 1:
y = width - 1
else:
y = j + 1
x = i

elif action == 2:
if i == height - 1:
x = height - 1
else:
x = i + 1
y = j

else:
if j == 0:
y = 0
else:
y = j - 1
x = i

action_n = select_action(epsilon, x, y)
reward = tunnel_rewards[x, y]

if is_final(x, y):
Q[i, j, action] += alpha * (reward - Q[i, j, action])
break

else:
Q[i, j, action] += alpha * (reward + (gamma * Q[x, y, action_n]) - Q[i, j, action])

i = x
j = y

The select_action() function has been designed to select a random action with the probability ε, and a greedy one with respect to Q(s, a), with the probability 1 - ε. The sarsa_step() function is straightforward, and executes a complete episode updating the Q(s, a) (that's why this is an online algorithm). At this point, it's possible to train the model for 20,000 episodes and employ a linear decay for ε during the first 15,000 episodes (when t > 15,000, ε is set equal to 0 in order to employ a purely greedy policy):

n_episodes = 20000
n_exploration = 15000

for t in range(n_episodes):
epsilon = 0.0

if t <= n_exploration:
epsilon = 1.0 - (float(t) / float(n_exploration))

sarsa_step(epsilon)

As usual, let's check the learned values (considering that the policy is greedy, we're going to plot V(s) = maxa Q(s, a)):

Final value matrix (as V(s) = maxa Q(s, a))

As expected, the Q function has been learned in a consistent way, and we can get a confirmation plotting the resulting policy:

Final policy

The policy is coherent with the initial objective, and the agent avoids all negative absorbing states, always trying to move towards the final positive state. However, some paths seem longer than expected. As an exercise, I invite the reader to retrain the model for a larger number of iterations, adjusting the exploration period. Moreover, is it possible to improve the model by increasing (or decreasing) the discount factor γ? Remember that γ → 0 leads to a short-sighted agent, which is able to select actions only considering the immediate reward, while γ → 1 forces the agent to take into account a larger number of future rewards. This particular example is based on a long environment, because the agent always starts from (0, 0) and must reach the farthest point; therefore, all intermediate states have less importance, and it's helpful to look at the future to pick the optimal actions. Using random starts can surely improve the policy for all initial states, but it's interesting to investigate how different γ values can affect the decisions; hence, I suggest repeating the experiment in order to evaluate the various configurations and increase awareness about the different factors that are involved in a TD algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.37.126