Value iteration in the checkerboard environment

To test this algorithm, we need to set an initial value matrix with all values equal to 0 (they can be also randomly chosen but, as we don't have any prior information on the final configuration, every initial choice is probabilistically equivalent):

import numpy as np

tunnel_values = np.zeros(shape=(height, width))

At this point, we can define the two functions to perform the value evaluation and the final policy selection (the function is_final() is the one defined in the previous example):

import numpy as np

def value_evaluation():
    old_tunnel_values = tunnel_values.copy() 
    
    for i in range(height):
        for j in range(width): 
            rewards = np.zeros(shape=(nb_actions, ))
            old_values = np.zeros(shape=(nb_actions, ))
            
            for k in range(nb_actions):
                if k == 0:
                    if i == 0:
                        x = 0
                    else:
                        x = i - 1
                    y = j

                elif k == 1:
                    if j == width - 1:
                        y = width - 1
                    else:
                        y = j + 1
                    x = i

                elif k == 2:
                    if i == height - 1:
                        x = height - 1
                    else:
                        x = i + 1
                    y = j

                else:
                    if j == 0:
                        y = 0
                    else:
                        y = j - 1
                    x = i
                
                rewards[k] = tunnel_rewards[x, y]
                old_values[k] = old_tunnel_values[x, y]
                
            new_values = np.zeros(shape=(nb_actions, ))
            
            for k in range(nb_actions):
                new_values[k] = rewards[k] + (gamma * old_values[k])
                
            tunnel_values[i, j] = np.max(new_values)

def policy_selection():
    policy = np.zeros(shape=(height, width)).astype(np.uint8)
    
    for i in range(height):
        for j in range(width):
            if is_final(i, j):
                continue
            
            values = np.zeros(shape=(nb_actions, ))
            
            values[0] = (tunnel_rewards[i - 1, j] + (gamma * tunnel_values[i - 1, j])) if i > 0 else -np.inf
            values[1] = (tunnel_rewards[i, j + 1] + (gamma * tunnel_values[i, j + 1])) if j < width - 1 else -np.inf
            values[2] = (tunnel_rewards[i + 1, j] + (gamma * tunnel_values[i + 1, j])) if i < height - 1 else -np.inf
            values[3] = (tunnel_rewards[i, j - 1] + (gamma * tunnel_values[i, j - 1])) if j > 0 else -np.inf
            
            policy[i, j] = np.argmax(values).astype(np.uint8)
            
    return policy

The main differences are in the value_evaluation() function, which now has to consider all possible successor states and select the value corresponding to the action that leads to the state with the highest value. Instead, the policy_selection() function is equivalent to policy_improvement(), but, as it is invoked only once, it outputs directly to the final optimal policy.

At this point, we can run a training cycle (assuming the same constants as before):

import numpy as np

e = 0

policy = None

while e < nb_max_epochs:
    e += 1
    old_tunnel_values = tunnel_values.copy()
    value_evaluation()
    
    if np.mean(np.abs(tunnel_values - old_tunnel_values)) < tolerance:
        policy = policy_selection()
        break

The final value configuration (after 127 iterations) is shown in the following chart:

Final value matrix

As in the previous example, the final value configuration is a function of the distance between each state and the ending one, but, in this case, the choice of γ = 0.9 isn't optimal. In fact, the wells close to the final state aren't considered very dangerous anymore. Plotting the final policy can help us understand the behavior:

Final policy

As expected, the wells that are far from the target are avoided, but the two that are close to the final state are accepted as reasonable penalties. This happens because the value iteration algorithm is very greedy with respect to the value and the discount factor, γ < 1.0; the effect of negative states can be compensated for by the final reward. In many scenarios, these states are absorbing, therefore their implicit reward is +∞ or -∞, meaning that no other actions can change the final value. I invite the reader to repeat the example with different discount factors (remember that an agent with γ → 1 is very short-sighted and will avoid any obstacle, even reducing the efficiency of the policy) and change the values of the final states. Moreover, the reader should be able to answer the question: What is the agent's behavior when the standard reward (whose default value is -0.1) is increased or decreased?

Table of Contents for Value iteration in the checkerboard environment

Create new playlist

Sign In

Sign Up

Table of Contents for
Value iteration in the checkerboard environment