We can now test the SARSA algorithm in the original tunnel environment (all of the elements that are not redefined are the same as the previous chapter). The first step is defining the Q(s, a) array and the constants employed in the training process:
import numpy as np
nb_actions = 4
Q = np.zeros(shape=(height, width, nb_actions))
x_start = 0
y_start = 0
max_steps = 2000
alpha = 0.25
As we want to employ a ε-greedy policy, we can set the starting point to (0, 0), forcing the agent to reach the positive final state. We can now define the functions needed to perform a training step:
import numpy as np
def is_final(x, y):
if (x, y) in zip(x_wells, y_wells) or (x, y) == (x_final, y_final):
return True
return False
def select_action(epsilon, i, j):
if np.random.uniform(0.0, 1.0) < epsilon:
return np.random.randint(0, nb_actions)
return np.argmax(Q[i, j])
def sarsa_step(epsilon):
e = 0
i = x_start
j = y_start
while e < max_steps:
e += 1
action = select_action(epsilon, i, j)
if action == 0:
if i == 0:
x = 0
else:
x = i - 1
y = j
elif action == 1:
if j == width - 1:
y = width - 1
else:
y = j + 1
x = i
elif action == 2:
if i == height - 1:
x = height - 1
else:
x = i + 1
y = j
else:
if j == 0:
y = 0
else:
y = j - 1
x = i
action_n = select_action(epsilon, x, y)
reward = tunnel_rewards[x, y]
if is_final(x, y):
Q[i, j, action] += alpha * (reward - Q[i, j, action])
break
else:
Q[i, j, action] += alpha * (reward + (gamma * Q[x, y, action_n]) - Q[i, j, action])
i = x
j = y
The select_action() function has been designed to select a random action with the probability ε, and a greedy one with respect to Q(s, a), with the probability 1 - ε. The sarsa_step() function is straightforward, and executes a complete episode updating the Q(s, a) (that's why this is an online algorithm). At this point, it's possible to train the model for 20,000 episodes and employ a linear decay for ε during the first 15,000 episodes (when t > 15,000, ε is set equal to 0 in order to employ a purely greedy policy):
n_episodes = 20000
n_exploration = 15000
for t in range(n_episodes):
epsilon = 0.0
if t <= n_exploration:
epsilon = 1.0 - (float(t) / float(n_exploration))
sarsa_step(epsilon)
As usual, let's check the learned values (considering that the policy is greedy, we're going to plot V(s) = maxa Q(s, a)):
As expected, the Q function has been learned in a consistent way, and we can get a confirmation plotting the resulting policy:
The policy is coherent with the initial objective, and the agent avoids all negative absorbing states, always trying to move towards the final positive state. However, some paths seem longer than expected. As an exercise, I invite the reader to retrain the model for a larger number of iterations, adjusting the exploration period. Moreover, is it possible to improve the model by increasing (or decreasing) the discount factor γ? Remember that γ → 0 leads to a short-sighted agent, which is able to select actions only considering the immediate reward, while γ → 1 forces the agent to take into account a larger number of future rewards. This particular example is based on a long environment, because the agent always starts from (0, 0) and must reach the farthest point; therefore, all intermediate states have less importance, and it's helpful to look at the future to pick the optimal actions. Using random starts can surely improve the policy for all initial states, but it's interesting to investigate how different γ values can affect the decisions; hence, I suggest repeating the experiment in order to evaluate the various configurations and increase awareness about the different factors that are involved in a TD algorithm.