Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

25 Advanced Policy Estimation Algorithms

In this chapter, we'll complete our exploration of the world of Reinforcement Learning (RL), focusing our attention on complex algorithms that can be employed to solve difficult problems. The topic of RL is extremely large, and we couldn't cover it in its entirety even if we dedicated an entire book to it; this chapter is instead based on many practical examples that you can use as a basis to work on more complex scenarios.

The topics that will be discussed in this chapter are:

The TD() algorithm
Actor-Critic TD(0)
SARSA
Q-learning, including a simple visual input and a neural network
Direct policy search through policy gradient

We can now start analyzing the natural extension of TD(0) algorithm, which helps take into account a longer sequence of transitions, obtaining a more accurate estimation of the value function.

TD() algorithm

In the previous chapter, we introduced the temporal difference strategy, and we discussed a simple example called TD(0). In the case of TD(0), the discounted reward is approximated by using a one-step backup.

Hence, if the agent performs an action at in the state s_t, and the transition to the state s_t₊₁ is observed, the approximation becomes the following:

If the task is episodic (as in many real-life scenarios) and has T(e_i) steps, the complete backup for the episode e_i is as follows:

The previous expression ends when the MDP process reaches an absorbing state; therefore, R_t is the actual value of the discounted reward. The difference between TD(0) and this choice is clear: in the first case, we can update the value function after each transition, whereas with a complete backup, we need to wait for the end of the episode. We can say that this method (which is called Monte Carlo, because it's based on the idea of averaging the overall reward of an entire sequence) is exactly the opposite of TD(0); therefore, it's reasonable to think about an intermediate solution, based on k-step backups.

In particular, our goal is to find an online algorithm that can exploit the backups once they are available. Let's imagine a sequence of four steps. The agent is in the first state and observes a transition; at this point, only a one-step backup is possible, and it's a good idea to update the value function in order to improve the convergence speed. After the second transition, the agent can use a two-step backup; however, it can also consider the first one-step backup in addition to the newer, longer one. So, we have two approximations:

Which of the preceding is the most reliable? Obviously, the second one depends on the first one (particularly when the value function is almost stabilized), and so on until the end of the episode. Hence, the most common strategy is to employ a weighted average that assigns a different level of importance to each backup (assuming the longest backup has k steps):

Watkins (in Watkins C.I.C.H., Learning from Delayed Rewards, Ph.D. Thesis, University of Cambridge, 1989) proved that this approach (with or without averaging) has the fundamental property of reducing the absolute error of the expected with respect to the optimal value function . In fact, he proved that the following inequality holds:

As is bounded between 0 and 1, the right-hand side is always smaller than the maximum absolute error , where V(s) is the value of a state during an episode. Therefore, the expected discounted return of a k-step backup (or of a combination of different backups) yields a more accurate estimation of the optimal value function if the policy is chosen to be greedy with respect to it. This is not surprising, as a longer backup incorporates more actual returns, but the importance of this theorem resides in its validity when an average of different k-step backups are employed.

In other words, it provides us with the mathematical proof that an intuitive approach actually converges, and it can also effectively improve both the convergence speed and the final accuracy. However, managing k coefficients is generally problematic, and in many cases, useless. The main idea behind TD() is to employ a single factor, , that can be tuned in order to meet specific requirements. The theoretical analysis (or forward view, as referred to by Sutton and Barto) is based, in a general case, on an exponentially decaying average. If we consider a geometric series with bounded between 0 and 1 (exclusive), we get:

Hence, we can consider the averaged discounted return with infinite backups as:

Before defining the finite case, it's helpful to understand how was built. As is bounded between 0 and 1, the factors decay proportionally to , so the first backup has the highest impact, and all of the subsequent ones have smaller and smaller influences on the estimation. This means that, in general, we are assuming that the estimation of R_t has more importance to the immediate backups (which become more and more precise), and we exploit the longer ones only to improve the estimated value.

Now, it should be clear that = 0 is equivalent to TD(0) (discussed in the previous chapter), because only the one-step backup remains in the sum (remember that 0⁰ = 1), while larger values involve all of the remaining backups. Let's now consider an episode e_i whose length is T(e_i). Conventionally, if the agent reached an absorbing state at t = T(e_i) , all of the remaining t + i returns are equal to R_t (this is straightforward, as all of the possible rewards have already been collected); therefore, we can truncate :

The first term of the previous expression involves all of the non-terminal states, while the second is equal to R_t discounted proportionally to the distance between the first time step and the final state. Again, if , we obtain TD(0), but we are now also authorized to consider (because the sum is always extended to a finite number of elements). When , we obtain , which means that we need to wait until the end of the episode to get the actual discounted reward.

As explained previously, this method is normally not a first-choice solution, because when the episodes are very long, the agent selects the actions with a value function that is not up to date in the majority of cases. Therefore, TD() is normally employed with values less than 1 in order to obtain the advantage of an online update, together with a correction based on the new states. To achieve this goal without looking at the future (we want to update V(s) as soon as new pieces of information are available), we need to introduce the concept of eligibility trace e(s) (sometimes, in the context of computational neuroscience, e(s) is also called stimulus trace).

An eligibility trace for a state s is a function of time that returns the weight (greater than 0) of the specific state. Let's imagine a sequence, s₁, s₂, …, s_n, and consider a state, s_i. After a backup V(s_i) is updated, the agent continues its exploration. When is a new update of s_i (given longer backups) important? If s_i is not visited anymore, the effect of longer backups must be smaller and smaller, and s_i is said to not be eligible for changes in V(s). This is a consequence of the previous assumption that shorter backups must generally have higher importance. So, if s_i is an initial state (or is immediately after the initial state) and the agent moves to other states, the effect of s_i must decay. Conversely, if s_i is revisited, it means that the previous estimation of V(s_i) is probably wrong, and hence s_i is eligible for a change.

To better understand this concept, imagine a sequence s₁, s₂, s₁, …. It's clear that when the agent is in s₁, as well as in s₂, it cannot select the right action; therefore, it's necessary to reevaluate V(s) until the agent is able to move forward.

The most common strategy (which is also discussed in Sutton R. S., Barto A. G., Reinforcement Learning, The MIT Press, 1998) is to define the eligibility traces in a recursive fashion. After each time step, e_t(s) decays by a factor equal to (to meet the requirement imposed by the forward view); but when the state s is revisited, e_t(s) is also increased by 1 (). In this way, we impose a jump in the trend of e(s) whenever we desire to emphasize its impact. However, as e(s) decays independently of the jumps, the states that are visited and revisited later have a lower impact than the ones that are revisited very soon.

The reason for this choice is very intuitive: the importance of a state revisited after a long sequence is clearly lower than the importance of a state that is revisited after a few steps. In fact, the estimation of R_t is obviously wrong if the agent moves back and forth between two states at the beginning of the episode, but the error becomes less significant when the agent revisits a state after having explored other areas. For example, a policy can allow an initial phase in order to reach a partial goal, and then it can force the agent to move back to reach a terminal state.

By exploiting the eligibility traces, TD() can achieve very fast convergence in more complex environments, with a trade-off between a one-step TD method and a Monte Carlo one (which is normally avoided). At this point, the reader might wonder if we are sure about the convergence, and luckily, the answer is positive. Dayan proved (in Dayan P., The convergence of TD () for General , Machine Learning 8, 3–4/1992) that TD() converges for a generic with only a few specific assumptions and the fundamental condition that the policy is Greedy in the Limit with Infinite Exploration (GLIE). The proof is very technical, and it's beyond the scope of this book; however, the most important assumptions (which are generally met) are:

The Markov Decision Process (MDP) has absorbing states (in other words, all of the episodes end in a finite number of steps).
All of the transition probabilities are not null (all states can be visited an infinite number of times).

The first condition is obvious. The absence of absorbing states yields infinite explorations, which are not compatible with a TD method (sometimes it's possible to prematurely end an episode, but this can either be unacceptable (in some contexts) or a sub-optimal choice (in many others)). Moreover, Sutton and Barto (in the aforementioned book) proved that TD() is equivalent to employing the weighted average of discounted return approximations, but without the constraint of looking ahead in the future (which is clearly impossible).

The complete TD() algorithm (with an optional forced termination of the episode) is:

Set an initial deterministic random policy .
Set the initial value array .
Set the initial eligibility trace array .
Set the number of episodes N_episodes.
Set a maximum number of steps per episode N_max.
Set a constant (for example, ).
Set a constant (for example, ).
Set a constant (for example, ).
Set a counter e = 0.
For i = 1 to N_episodes:
1. Create an empty state list L.
2. Observe the initial state s_i and append s_i to L.
3. While s_i is non-terminal and e < N_max:
  1. e = e + 1.
  2. Select the action .
  3. Observe the transition .
  4. Compute the TD error as .
  5. Increment the eligibility trace e(s_i) = e(s_i) + 1.
  6. For :
    1. Update the value .
    2. Update the eligibility trace .
4. Set s_i = s_j and append s_j to L.

The reader can better understand the logic of this algorithm by considering the TD error and its backpropagation. Even if this is only a comparison, it's possible to imagine the behavior of TD() as similar to the stochastic gradient descent (SGD) algorithms employed to train a neural network.

In fact, the error is propagated to the previous states (analogous to the lower layers of an MLP) and affects them proportionally to their importance, which is defined by their eligibility traces. Hence, a state with a higher eligibility trace can be considered more responsible for the error; therefore, the corresponding value must be corrected proportionally. This isn't a formal explanation, but it can simplify comprehension of the dynamics without an excessive loss of rigor.

TD() in a more complex checkerboard environment

At this point, we want to test the TD() algorithm with a slightly more complex checkerboard environment – a tunnel environment, where the initial state is on one side, and the final state on the other.

To add a little complexity, along with the absorbing states we'll also consider some intermediate positive states, which can be imagined as checkpoints. An agent should learn the optimal path from any cell to the final state, trying to pass through the highest number of checkpoints possible. Let's start by defining the new structure:

import numpy as np
width = 15
height = 5
y_final = width - 1
x_final = height - 1
y_wells = [0, 1, 3, 5, 5, 6, 7, 9, 10, 11, 12, 14]
x_wells = [3, 1, 2, 0, 4, 3, 1, 3, 1, 2, 4, 1]
y_prizes = [0, 3, 4, 6, 7, 8, 9, 12]
x_prizes = [2, 4, 3, 2, 1, 4, 0, 2]
standard_reward = -0.1
tunnel_rewards = np.ones(shape=(height, width)) * 
                 standard_reward
def init_tunnel_rewards():
    for x_well, y_well in zip(x_wells, y_wells):
        tunnel_rewards[x_well, y_well] = -5.0
    for x_prize, y_prize in zip(x_prizes, y_prizes):
        tunnel_rewards[x_prize, y_prize] = 1.0
    tunnel_rewards[x_final, y_final] = 5.0
init_tunnel_rewards()

The reward structure is shown in the following diagram:

Reward schema in the new tunnel environment

At this point, we can proceed to initialize all of the constants (in particular, we have chosen , which is an intermediate solution that guarantees an accuracy close to a Monte Carlo method, without compromising the learning speed):

nb_actions = 4
max_steps = 1000
alpha = 0.25
lambd = 0.6
gamma = 0.95
tunnel_values = np.zeros(shape=(height, width))
eligibility_traces = np.zeros(shape=(height, width))
policy = np.random.randint(0, nb_actions, 
                           size=(height, width)).
    astype(np.uint8)

In Python, the keyword lambda is reserved; we used the truncated expression lambd to declare the constant. As we want to start from a random cell, we need to repeat the same procedure presented in the previous chapter; but in this case, we are also including the checkpoint states:

xy_grid = np.meshgrid(np.arange(0, height), 
                      np.arange(0, width), 
                      sparse=False)
xy_grid = np.array(xy_grid).T.reshape(-1, 2)
xy_final = list(zip(x_wells, y_wells)) + 
           list(zip(x_prizes, y_prizes))
xy_final.append([x_final, y_final])
xy_start = []
for x, y in xy_grid:
    if (x, y) not in xy_final:
        xy_start.append([x, y])
xy_start = np.array(xy_start)
def starting_point():
    xy = np.squeeze(xy_start[
                        np.random.randint(
                            0, xy_start.shape[0], 
                            size=1)])
    return xy[0], xy[1]

We can now define the episode() function, which implements a complete TD() cycle. As we don't want the agent to roam around trying to pass through the checkpoints an infinite number of times, we have decided to reduce the reward during the exploration, to incentivize the agent to pass through only the necessary checkpoints—trying, at the same time, to reach the final state as soon as possible:

def is_final(x, y):
    if (x, y) in zip(x_wells, y_wells) or 
            (x, y) == (x_final, y_final):
        return True
    return False
def episode():
    (i, j) = starting_point()
    x = y = 0
    e = 0
    state_history = [(i, j)]
    init_tunnel_rewards()
    total_reward = 0.0
    while e < max_steps:
        e += 1
        action = policy[i, j]
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
        reward = tunnel_rewards[x, y]
        total_reward += reward
        td_error = reward + 
                   (gamma * tunnel_values[x, y]) - 
                   tunnel_values[i, j]
        eligibility_traces[i, j] += 1.0
        for sx, sy in state_history:
            tunnel_values[sx, sy] += 
                (alpha * td_error * 
                 eligibility_traces[sx, sy])
            eligibility_traces[sx, sy] *= 
                (gamma * lambd)
        if is_final(x, y):
            break
        else:
            i = x
            j = y
            state_history.append([x, y])
            tunnel_rewards[x_prizes, y_prizes] *= 0.85
    return total_reward

It's also necessary to create a function to perform the policy selection:

def policy_selection():
    for i in range(height):
        for j in range(width):
            if is_final(i, j):
                continue
            values = np.zeros(shape=(nb_actions,))
            values[0] = (tunnel_rewards[i - 1, j] +
                         (gamma * 
                          tunnel_values[i - 1, j])) 
                if i > 0 else -np.inf
            values[1] = (tunnel_rewards[i, j + 1] +
                         (gamma * 
                          tunnel_values[i, j + 1])) 
                if j < width - 1 else -np.inf
            values[2] = (tunnel_rewards[i + 1, j] +
                         (gamma * 
                          tunnel_values[i + 1, j])) 
                if i < height - 1 else -np.inf
            values[3] = (tunnel_rewards[i, j - 1] +
                         (gamma * 
                          tunnel_values[i, j - 1])) 
                if j > 0 else -np.inf
            policy[i, j] = np.argmax(values).
                astype(np.uint8)

The is_final() and policy_selection() functions are the same ones defined in the previous chapter, and need no explanation. Even if it's not really necessary, we have decided to implement a forced termination after a number of steps, equal to max_steps. This is helpful at the beginning because as the policy is not ε-greedy, the agent can remain stuck in a looping exploration that never ends. We can now train the model for a fixed number of episodes (alternatively, it's possible to stop the process when the value array doesn't change anymore):

n_episodes = 5000
total_rewards = []
for _ in range(n_episodes): 
    e_reward = episode()
    total_rewards.append(e_reward)
    policy_selection()

The episode() function returns the total rewards; therefore, it's useful to check how the agent learning process evolved:

Total rewards achieved by the agent

At the beginning (for about 500 episodes), the agent employs an unacceptable policy that yields very negative total rewards. However, in under 1,000 iterations, the algorithm reaches an optimal policy that is only slightly improved by the following episodes. The oscillations are due to the different starting points; however, the total rewards are never negative, and as the checkpoint weights decay, this is a positive signal, indicating that the agent reaches the final positive state. To confirm this hypothesis, we can plot the learned value function:

Final value matrix

The values are coherent with our initial analysis; in fact, they tend to be higher when the cell is close to a checkpoint, but at the same time, the global configuration (considering a policy that's greedy with respect to V(s)) forces the agent to reach the ending state whose surrounding values are the highest. The last step is checking the actual policy, with a particular focus on the checkpoints:

Final policy

As it's possible to observe, the agent tries to pass through the checkpoints, but when it's close to the final state, it (correctly) prefers to end the episode as soon as possible. I invite the reader to repeat the experiment using different values for the constant , and changing the environment dynamics for the checkpoints.

What happens if their values remain the same? Is it possible to improve the policy with a larger ?

It's important to remember that, as we are extensively using random values, successive experiments can yield different results due to different initial conditions. However, the algorithm should always converge to an optimal policy when the number of episodes is high enough.

Actor-Critic TD(0) in the checkerboard environment

In this example, we want to employ an alternative algorithm called Actor-Critic, together with TD(0). In this method, the agent is split into two components, a critic, which is responsible for evaluating the quality of the value estimation, and an actor, which selects and performs an action. As pointed out by Dayan (in Dayan P., Abbott L. F., Theoretical Neuroscience, The MIT Press, 2005), the dynamics of the Actor-Critic approach are similar to the interleaving policy evaluation and policy improvement steps. In fact, the knowledge of the critic is obtained through an iterative process, and its initial evaluations are normally sub-optimal. The structural schema is shown in the following diagram:

Actor-Critic schema

In this particular case, it's preferable to employ a -greedy soft policy based on the softmax function. The model stores a matrix (or an approximating function) called policy importance, where each entry p_i(s, a) is a value representing the preference for a specific action in a certain state. The actual stochastic policy is obtained by applying the softmax with a simple trick to increase the numerical stability when the exponentials become very large:

After performing the action a in the state s_i and observing the transition to the state s_j with a reward r_ij, the critic evaluates the TD error:

If , the transition is considered positive, because the value is increasing. Conversely, when , the critic evaluates the action as negative because the previous value was higher than the new estimation. A more general approach is based on the concept of advantage, which is defined as:

Normally, one of the terms from the previous expression can be approximated because we often don't have enough knowledge to compute, for example, the effect of each action in each state. In our case, we cannot compute the Q function directly; hence, we approximate it with the term . It's clear that the role of the advantage is analogous to the one of the TD errors (which is an approximation) and must represent the confirmation that an action in a certain state is a good or bad choice. An analysis of All Advantage Actor-Critic (A³C) algorithms (in other words, improvements of the standard policy gradient algorithm) is beyond the scope of this book. However, the reader can find some helpful pieces of information in Schulman J., Moritz P., Levine S., Jordan M. I., Abbeel P., High-Dimensional Continuous Control Using Generalized Advantage Estimation, ICLR 2016.

Of course, an Actor-Critic correction is not enough. To improve the policy, it's necessary to employ a standard algorithm (such as TD(0), TD(), or least square regression, which can be implemented using a neural network) in order to learn the correct value function, V(s). As for many other algorithms, this process can converge only after a sufficiently high number of iterations, which must be exploited to visit the states many times, experimenting with all possible actions. Hence, with a TD(0) approach, the first step after evaluating the TD error is updating V(s) using the rule defined in the previous chapter:

The second step is more pragmatic; in fact, the main role of the critic is actually to criticize every action, deciding when it's better to increase or decrease the probability of selecting it again in a certain state. This goal can be achieved by simply updating the policy importance:

The role of the learning rate is extremely important; in fact, incorrect values (in other words, values that are too high) can yield initial wrong corrections that may compromise the convergence. It's essential to not forget that the value function is almost completely unknown at the beginning, and therefore the critic has no chance to increase the right probability with awareness.

For this reason, I always suggest starting with a very small value (such as ) and increase it only if the convergence speed of the algorithm is effectively improved. As the policy is based on the softmax function, after a critic update, the values will always be renormalized, resulting in an actual probability distribution. After an adequately large number of iterations, with the right choice of both and , the model is able to learn both a stochastic policy and a value function. Therefore, it's possible to employ the trained agent by always selecting the action with the highest probability (which corresponds to an implicitly greedy behavior):

Let's now apply this algorithm to the tunnel environment. The first step is defining the constants (as we are looking for a long-sighted agent, we are setting the discount factor ):

import numpy as np
tunnel_values = np.zeros(shape=(height, width))
gamma = 0.99
alpha = 0.25
rho = 0.001

At this point, we need to define the policy importance array, and a function to generate the softmax policy:

nb_actions = 4
policy_importances = np.zeros(
    shape=(height, width, nb_actions))
def get_softmax_policy():
    softmax_policy = policy_importances - 
                     np.amax(policy_importances, 
                             axis=2, keepdims=True)
    return np.exp(softmax_policy) / 
           np.sum(np.exp(softmax_policy), 
                  axis=2, keepdims=True)

The functions needed to implement a single training step are very straightforward, and the reader should already be familiar with their structure:

def select_action(epsilon, i, j):
    if np.random.uniform(0.0, 1.0) < epsilon:
        return np.random.randint(0, nb_actions)
    
    policy = get_softmax_policy()
    return np.argmax(policy[i, j])
def action_critic_episode(epsilon):
    (i, j) = starting_point()
    x = y = 0
    e = 0
    while e < max_steps:
        e += 1
        action = select_action(epsilon, i, j)
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
        reward = tunnel_rewards[x, y]
        td_error = reward + 
                   (gamma * tunnel_values[x, y]) - 
                   tunnel_values[i, j]
        tunnel_values[i, j] += (alpha * td_error)
        policy_importances[i, j, action] += 
            (rho * td_error)
        if is_final(x, y):
            break
        else:
            i = x
            j = y

At this point, we can train the model with 50,000 iterations, and 30,000 explorative ones (with a linear decay of the exploration factor):

n_episodes = 50000
n_exploration = 30000
for t in range(n_episodes):
    epsilon = 0.0
    
    if t <= n_exploration:
        epsilon = 1.0 - (float(t) / 
                         float(n_exploration))
        
    action_critic_episode(epsilon)

The resulting greedy policy is shown in the following figure:

Final greedy policy

The final greedy policy is consistent with the objective, and the agent always reaches the final positive state by avoiding the wells. This kind of algorithm can appear more complex than necessary; however, in complex situations, it turns out to be extremely effective. In fact, the learning process can be dramatically improved thanks to the fast corrections performed by the critic. Moreover, the author has noticed that the Actor-Critic is more robust to wrong (or noisy) evaluations with respect to algorithms like SARSA or Q-Learning, which can suffer from instabilities that are, instead, reduced by the estimation of the advantage term.

As the policy is learned separately, the effect of small variations in V(s) cannot easily change the probabilities (in particular, when an action is generally much stronger than the others). On the other hand, as discussed previously, it's necessary to avoid premature convergence in order to let the algorithm modify the importance/probabilities, without an excessive number of iterations. The right trade-off can be found only after a complete analysis of each specific scenario, and unfortunately, there are no general rules that work in every case. My suggestion is to test various configurations, starting with small values (and, for example, a discount factor of ), evaluating the total reward achieved after the same exploration period.

Complex deep learning models (such as asynchronous A3C; see Mnih V., Puigdomènech Badia A., Mirza M., Graves A., Lillicrap T. P., Harley T., Silver D., Kavukcuoglu K., Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783 [cs.LG] for further information) are based on a single network that outputs both the softmax policy (whose actions are generally proportional to their probability) and the value.

Instead of employing an explicitly -greedy soft policy, it's possible to add a maximum-entropy constraint to the global cost function:

As the entropy is at its maximum when all of the actions have the same probability, this constraint (with an appropriate weight) forces the algorithm to increase the exploration probability until an action becomes dominant and there's no more need to avoid a greedy selection. This is a sound and easy way to employ an adaptive ε-greedy policy, because as the model works with each state separately, the states where the uncertainty is very low can become greedy; it's possible to automatically keep a high entropy whenever it's necessary to continue the exploration in order to maximize the reward.

The effect of double correction, together with a maximum-entropy constraint, improves the convergence speed of the model, encourages the exploration during the initial iterations, and yields very high final accuracy. I invite the reader to implement this variant with other scenarios and algorithms. In particular, at the end of this chapter, we are going to experiment with an algorithm based on a neural network. As the example is pretty simple, I suggest using TensorFlow to create a small network based on the Actor-Critic approach (a good starting point is the article at https://blog.tensorflow.org/2018/07/deep-reinforcement-learning-keras-eager-execution.html). The reader can employ a Mean Squared Error (MSE) loss for the value and softmax cross-entropy for the policy. Once the models work successfully with our toy examples, it will be possible to start working with more complex scenarios (like the ones proposed in OpenAI Gym at https://gym.openai.com).

SARSA algorithm

SARSA (the name is derived from the sequence state-action-reward-state-action) is a natural extension of TD(0) to the estimation of the Q function. Its standard formulation (which is sometimes called one-step SARSA, or SARSA(0), for the same reasons explained in the previous chapter) is based on a single next reward r_t₊₁, which is obtained by executing the action a_t in the state s_t. The temporal difference computation is based on the following update rule:

The equation is equivalent to TD(0), and if the policy is chosen to be GLIE, it has been proven (in Singh S., Jaakkola T., Littman M. L., Szepesvári C., Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms, Machine Learning, 39/2000) that SARSA converges to an optimal policy with a probability of 1 when all couples (state, action) are experienced an infinite number of times. This means that if the policy is updated to be greedy with respect to the current value function induced by Q, it holds that:

The same result is valid for the Q function. In particular, the most important conditions required by the proof are:

The learning rate with the constraints and .
The variance of the rewards must be finite.

The first condition is particularly important when is a function of the state and the time step; however, in many cases, it is a constant bounded between 0 and 1, and hence, . A common way to solve this problem (above all when a large number of iterations are required) is to let the learning rate decay (in other words, exponentially) during the training process. Instead, to mitigate the effect of very large rewards, it's possible to clip them in a suitable range (for example, (-1,1)).

In many cases, it's not necessary to employ these strategies, but in more complex scenarios, they can become crucial in order to ensure the convergence of the algorithm. Moreover, as pointed out in the previous chapter, these kinds of algorithms need a long exploration phase before starting to stabilize the policy. The most common strategy is to employ a -greedy policy, with a temporal decay of the exploration factor. During the first iterations, the agent must explore without caring about the returns of the actions. In this way, it's possible to assess the actual values before the beginning of a final refining phase characterized by a purely greedy exploration, based on a more precise approximation of V(s).

The complete SARSA(0) algorithm (with an optional forced termination of the episode) is:

Set an initial deterministic random policy .
Set the initial value array .
Set the number of episodes N_episodes.
Set a maximum number of steps per episode N_max.
Set a constant (for example, ).
Set a constant (for example, ).
Set an initial exploration factor (for example, ).
Define a policy to let the exploration factor decay (linear or exponential).
Set a counter e = 0.
For i = 1 to N_episodes:
1. Observe the initial state s_i.
2. While s_i is non-terminal and e < N_max:
  1. e = e + 1.
  2. Select the action with an exploration factor .
  3. Observe the transition .
  4. Select the action with an exploration factor .
  5. Update the Q(s_t, a_t) function (if s_j is terminal, set Q(s_t₊₁, a_t₊₁) = 0).
  6. Set s_i = s_j.

The concept of eligibility trace can also be extended to SARSA (and other TD methods); however, that is beyond the scope of this book. A reader who is interested can find all of the algorithms (together with their mathematical formulations) in Sutton R. S., Barto A. G., Reinforcement Learning, The MIT Press, 1998.

SARSA in the checkerboard environment

We can now test the SARSA algorithm in the original tunnel environment (all of the elements that are not redefined are the same as the previous chapter). The first step is defining the Q(s, a) array and the constants employed in the training process:

import numpy as np
nb_actions = 4
Q = np.zeros(shape=(height, width, nb_actions))
x_start = 0
y_start = 0
max_steps = 2000
alpha = 0.25

As we want to employ a -greedy policy, we can set the starting point to (0, 0), forcing the agent to reach the positive final state. We can now define the functions needed to perform a training step:

def is_final(x, y):
    if (x, y) in zip(x_wells, y_wells) or 
            (x, y) == (x_final, y_final):
        return True
    return False
def select_action(epsilon, i, j):
    if np.random.uniform(0.0, 1.0) < epsilon:
        return np.random.randint(0, nb_actions)
    return np.argmax(Q[i, j])
def sarsa_step(epsilon):
    e = 0
    i = x_start
    j = y_start
    while e < max_steps:
        e += 1
        action = select_action(epsilon, i, j)
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
        action_n = select_action(epsilon, x, y)
        reward = tunnel_rewards[x, y]
        if is_final(x, y):
            Q[i, j, action] += alpha * 
                               (reward - 
                                Q[i, j, action])
            break
        else:
            Q[i, j, action] += alpha * 
                               (reward +
                                (gamma * 
                                 Q[x, y, action_n]) -
                                Q[i, j, action])
            i = x
            j = y

The select_action() function has been designed to select a random action with the probability ε, and a greedy one with respect to Q(s, a), with the probability . The sarsa_step() function is straightforward and executes a complete episode updating Q(s, a) (that's why this is an online algorithm). At this point, it's possible to train the model for 20,000 episodes (I invite the reader to test also smaller values to learn to evaluate the convergence speed) and employ a linear decay for during the first 15,000 episodes (when t > 15,000, is set equal to 0 in order to employ a purely greedy policy):

n_episodes = 20000
n_exploration = 15000
for t in range(n_episodes):
    epsilon = 0.0
    
    if t <= n_exploration:
        epsilon = 1.0 - (float(t) / 
                         float(n_exploration))
        
    sarsa_step(epsilon)

As usual, let's check the learned values (considering that the policy is greedy, we're going to plot ):

Final value matrix (as )

As expected, the Q function has been learned in a consistent way, and we can get confirmation by plotting the resulting policy:

Final policy

The policy is coherent with the initial objective, and the agent avoids all negative absorbing states, always trying to move toward the final positive state. However, some paths seem longer than expected. As an exercise, I invite the reader to retrain the model for a larger number of iterations, adjusting the exploration period. Moreover, is it possible to improve the model by increasing (or decreasing) the discount factor ? Remember that leads to a short-sighted agent, which is able to select actions only considering the immediate reward, while forces the agent to take into account a larger number of future rewards.

This particular example is based on a long environment because the agent always starts from (0, 0) and must reach the farthest point; therefore, all intermediate states have less importance, and it's helpful to look at the future to pick the optimal actions. Using random starts can surely improve the policy for all initial states, but it's interesting to investigate how different values can affect the decisions; hence, I suggest repeating the experiment in order to evaluate the various configurations and increase awareness about the different factors that are involved in a TD algorithm.

Q-learning

This algorithm was proposed by Watkins (in Watkins C.I.C.H., Learning from delayed rewards, Ph.D. Thesis, University of Cambridge, 1989 and furtherly analyzed in Watkins C.I.C.H., Dayan P., Technical Note Q-Learning, Machine Learning 8, 1992) as a more efficient alternative to SARSA. The main feature of Q-learning is that the TD update rule is immediately greedy with respect to the Q(s_t₊₁, a) function (assuming that the agent received the reward r_t after performing the action at while in the state s_t):

The key idea is to compare the current Q(s_t, a_t) value with the maximum Q value achievable when the agent is in the successor state. Assuming , the previous equation can be transformed into a TD_error structure:

The first term is the current reward, the second is the discounted maximum reward that the agent can theoretically achieve using its current knowledge and the last one is the estimation of the Q function. As the policy must be GLIE, the convergence speed can be increased by avoiding wrong estimations due to the selection of a Q value that won't be associated with the final action.

Instead, by choosing the maximum Q value (with the current knowledge), the algorithm will move toward the optimal solution faster than SARSA, and moreover, the convergence proof is less restrictive.

In fact, Watkins and Dayan (in the aforementioned papers) proved that if , the learning rate (in the case of Q-learning, must be always smaller than 1) with the same constraints imposed for SARSA ( and ), then the estimated Q function converges with probability 1 to the optimal one:

As discussed for SARSA, the conditions on the rewards and the learning rate can be managed by employing a clipping function and a temporal decay, respectively. In almost all deep Q-learning applications, these are extremely important factors to guarantee the convergence; therefore, I invite the reader to consider them whenever the training process isn't able to converge to an acceptable solution.

The complete Q-learning algorithm (with an optional forced termination of the episode) is:

Set an initial deterministic random policy .
Set the initial value array .
Set the number of episodes N_episodes.
Set a maximum number of steps per episode N_max.
Set a constant (for example, ).
Set a constant (for example, ).
Set an initial exploration factor (for example, ).
Define a policy to let the exploration factor decay (linear or exponential).
Set a counter e = 0.
For i = 1 to N_episodes:
1. Observe the initial state s_i.
2. While s_i is non-terminal and e < N_max:
  1. e = e + 1.
  2. Select the action with an exploration factor .
  3. Observe the transition .
  4. Select the action with an exploration factor .
  5. Update the Q(s_t, a_t) function (if s_j is terminal, set Q(s_t₊₁, a_t₊₁)) using .
  6. Set s_i = s_j.

Q-learning in the checkerboard environment

Let's repeat the previous experiment with the Q-learning algorithm. As all of the constants are the same (as well as the choice of a -greedy policy and the starting point set to (0, 0)), we can directly define the function that implements the training for a single episode:

import numpy as np
def q_step(epsilon):
    e = 0
    i = x_start
    j = y_start
    while e < max_steps:
        e += 1
        action = select_action(epsilon, i, j)
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
        reward = tunnel_rewards[x, y]
        if is_final(x, y):
            Q[i, j, action] += alpha * 
                               (reward -
                                Q[i, j, action])
            break
        else:
            Q[i, j, action] += alpha * 
                               (reward +
                                (gamma * 
                                 np.max(Q[x, y])) -
                                Q[i, j, action])
            i = x
            j = y

We can now train the model for 5,000 iterations, with 3,500 explorative ones:

n_episodes = 5000
n_exploration = 3500
for t in range(n_episodes):
    epsilon = 0.0
    
    if t <= n_exploration:
        epsilon = 1.0 - (float(t) / 
                         float(n_exploration))
        
    q_step(epsilon)

The resulting value matrix (defined as in the SARSA experiment) is:

Final value matrix

Again, the learned Q function (and, obviously, also the greedy V(s)) is coherent with the initial objective (in particular, considering the starting point set to (0, 0)), and the resulting policy can immediately confirm this result:

Final policy

The behavior of Q-learning is not very different from SARSA (even if the convergence is faster), and some initial states are not perfectly managed. This is a consequence of our choice; therefore, I invite the reader to repeat the exercise using random starts and comparing the training speed of Q-learning and SARSA.

Q-learning modeling the policy with a neural network

Now, we want to test the Q-learning algorithm using a smaller checkerboard environment and a neural network (with TensorFlow/Keras). The main difference from the previous examples is that now, the state is represented by a screenshot of the current configuration; hence, the model must learn how to associate a value with each input image and action. This isn't actual deep Q-learning (which is based on deep convolutional networks and requires more complex environments than we have space to discuss in this book), but it shows how such a model can learn an optimal policy with the same input provided to a human being. In order to reduce the training time, we are considering a square checkerboard environment, with four negative absorbing states and a positive final one:

import numpy as np
width = 5
height = 5
nb_actions = 4
y_final = width - 1
x_final = height - 1
y_wells = [0, 1, 3, 4]
x_wells = [3, 1, 2, 0]
standard_reward = -0.1
tunnel_rewards = np.ones(shape=(height, width)) * 
                 standard_reward
for x_well, y_well in zip(x_wells, y_wells):
    tunnel_rewards[x_well, y_well] = -5.0
tunnel_rewards[x_final, y_final] = 5.0

A graphical representation of the rewards is shown in the following figure:

Rewards in the smaller checkerboard environment

As we want to provide the network with a graphical input, we need to define a function to create a matrix representing the tunnel:

def reset_tunnel():
    tunnel = np.zeros(shape=(height, width), 
                      dtype=np.float32)
    for x_well, y_well in 
            zip(x_wells, y_wells):
        tunnel[x_well, y_well] = -1.0
    tunnel[x_final, y_final] = 0.5
    return tunnel

The reset_tunnel() function sets all values equal to 0, except for (which is marked with -1) and the final state (defined by 0.5). The position of the agent (defined with the value 1) is directly managed by the training function. At this point, we can create and compile our neural network. As the problem is not very complex, we are employing an MLP with the following structure:

Input layer
Hidden layer with six neurons with hyperbolic tangent activation
Hidden layer with four neurons with hyperbolic tangent activation

Output layer with nb_actions neurons with linear activation (as they represent the Q function values):

import tensorflow as tf
model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(6, 
                              input_dim=width * height,
                              activation='tanh'),
        tf.keras.layers.Dense(4,
                              activation='tanh'),
        tf.keras.layers.Dense(nb_actions,
                              activation='linear')
])
optimizer = tf.keras.optimizers.Adam(0.01)
model.compile(optimizer, loss='mse')

The input is a flattened array, while the output is the Q function (all of the values corresponding to each action). The network is trained using RMSprop and an MSE loss function (our goal is to reduce the MSE between the actual value and the prediction). In order to train and query the network, it's helpful to create two dedicated functions:

def train(state, q_value):
    model.train_on_batch(
        np.expand_dims(state.flatten(), axis=0),
        np.expand_dims(q_value, axis=0))
def get_Q_value(state):
    return model.predict(
        np.expand_dims(state.flatten(), axis=0))[0]
def select_action_neural_network(epsilon, state):
    Q_value = get_Q_value(state)
    if np.random.uniform(0.0, 1.0) < epsilon:
        return Q_value, 
               np.random.randint(0, nb_actions)
    return Q_value, np.argmax(Q_value)

The behavior of these functions is straightforward. The only element that may be new to the reader is the use of the train_on_batch() method. Contrary to fit(), this function allows us to perform a single training step, given a batch of input-output couples (in our case, we always have a single couple). As our goal is finding an optimal path to the final state, starting from every possible cell, we are going to employ random starts:

xy_grid = np.meshgrid(np.arange(0, height),
                      np.arange(0, width), sparse=False)
xy_grid = np.array(xy_grid).T.reshape(-1, 2)
xy_final = list(zip(x_wells, y_wells))
xy_final.append([x_final, y_final])
xy_start = []
for x, y in xy_grid:
    if (x, y) not in xy_final:
        xy_start.append([x, y])
xy_start = np.array(xy_start)
def starting_point():
    xy = np.squeeze(xy_start[
            np.random.randint(0, 
                              xy_start.shape[0], 
                              size=1)])
    return xy[0], xy[1]

Now, we can define the functions needed to perform a single training step:

def is_final(x, y):
    if (x, y) in zip(x_wells, y_wells) or 
            (x, y) == (x_final, y_final):
        return True
    return False
def q_step_neural_network(epsilon, initial_state):
    e = 0
    total_reward = 0.0
    (i, j) = starting_point()
    prev_value = 0.0
    tunnel = initial_state.copy()
    tunnel[i, j] = 1.0
    while e < max_steps:
        e += 1
        q_value, action = 
            select_action_neural_network(epsilon, tunnel)
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
        reward = tunnel_rewards[x, y]
        total_reward += reward
        tunnel_n = tunnel.copy()
        tunnel_n[i, j] = prev_value
        tunnel_n[x, y] = 1.0
        prev_value = tunnel[x, y]
        if is_final(x, y):
            q_value[action] = reward
            train(tunnel, q_value)
            break
        else:
            q_value[action] = reward + 
                              (gamma *
                               np.max(
                            get_Q_value(tunnel_n)))
            train(tunnel, q_value)
            i = x
            j = y
            tunnel = tunnel_n.copy()
    return total_reward

The q_step_neural_network() function is very similar to the one defined in the previous example. The only difference is the management of the visual state. Every time there's a transition, the value 1.0 (denoting the agent) is moved from the old position to the new one, and the value of the previous cell is reset to its default (saved in the prev_value variable).

Another secondary difference is the absence of because there's already a learning rate set in the SGD algorithm, and it doesn't make sense to add another parameter to the model. We can now train the model for 2,000 iterations, with 1,200 explorative ones:

n_episodes = 2000
n_exploration = 1200
total_rewards = []
for t in range(n_episodes):
    tunnel = reset_tunnel()
    
    epsilon = 0.0
    
    if t <= n_exploration:
        epsilon = 1.0 - (float(t) / 
                   float(n_exploration))
        
    t_reward= q_step_neural_network(epsilon, tunnel)
    total_rewards.append(t_reward)

When the training process has finished, we can analyze the total rewards, in order to understand whether the network has successfully learned the Q function (the bold line is a smoothed version using a Savitzky-Golay filter):

Total rewards obtained by the neural network Q-learning algorithm

It's clear that the model is working well, because after the exploration period, the total reward becomes stationary after 1,000 episodes, with only small oscillations due to the different path lengths (however, the final plot can be different because of the internal random state employed by TensorFlow/Keras). To see a confirmation, let's generate the trajectories for all of the possible initial states, using the greedy policy (equivalent to ):

trajectories = []
tunnels_c = []
for I, j in xy_start:
    tunnel = reset_tunnel()
    prev_value = 0.0
    trajectory = [[I, j, -1]]
    tunnel_c = tunnel.copy()
    tunnel[i, j] = 1.0
    tunnel_c[i, j] = 1.0
    final = False
    e = 0
    while not final and e < max_steps:
        e += 1
        q_value = get_Q_value(tunnel)
        action = np.argmax(q_value)
        if action == 0:
            if I == 0:
                x = 0
            else:
                x = I – 1
            y = j
        elif action == 1:
            if j == width – 1:
                y = width – 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if I == height – 1:
                x = height – 1
            else:
                x = I + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j – 1
            x = i
        trajectory[e – 1][2] = action
        trajectory.append([x, y, -1])
        tunnel[I, j] = prev_value
        prev_value = tunnel[x, y]
        tunnel[x, y] = 1.0
        tunnel_c[x, y] = 1.0
        i = x
        j = y
        final = is_final(x, y)
    
    trajectories.append(np.array(trajectory))
    tunnels_c.append(tunnel_c)
    
trajectories = np.array(trajectories)

Twelve random trajectories are shown in the following figure:

Twelve trajectories generated using the greedy policy

The agent always follows the optimal policy, independent from the initial state, and never ends up in a well. Even if this example is quite simple, it's helpful to introduce the reader to the concept of deep Q-learning (for further details, the reader can check the introductory paper, Li Y., Deep Reinforcement Learning: An Overview, arXiv:1701.07274 [cs.LG]). In a general case, the environment can be a more complex game (like Atari or Sega), and the number of possible actions is very limited. Moreover, there's no possibility to employ random starts, but it's generally a good practice to skip a number of initial frames, in order to avoid a bias to the estimator.

Clearly, the network must be more complex (involving convolutions to better learn the geometric dependencies), and the number of iterations must be extremely large.

Many other tricks and specific algorithms can be employed in order to speed up the convergence, but they are beyond the scope of this book. However, the general process and its logic are almost the same, and it's not difficult to understand why some strategies are preferable and how the accuracy can be improved. As an exercise, I invite the reader to create more complex environments, with or without checkpoints and stochastic rewards. It's not surprising to see how the model will be able to easily learn the dynamics with a sufficiently large number of episodes. Moreover, as suggested in the Actor-Critic section, it's a good idea to use TensorFlow to implement such a model, comparing the performances with Q-learning.

Direct policy search through policy gradient

The last method we're going to discuss doesn't employ a proxy to find the optimal policy, but rather looks for it directly. In this context, we always assume we're working with a stochastic environment where the transitions are not fully deterministic. For example, a robot can assign a high probability to a transition but, in order to increase the robustness, it has also to include a noise term that can lead the transition to a different state. Therefore, we need to include a transition probability .

Given this element, we can evaluate the overall probability of an entire sequence (that normally ends in a final state): S_k = (s₁, s₂, …, s_k). To do so, we need to define a parameterized policy ; therefore, the probability of S_k can be expressed as a conditional one: . The full expression requires us to explicitly separate the state transition component from the policy and to introduce the actions (which were implicit in the previous expressions):

This term represents the actual probability that an agent with a knowledge encoded in the parameter set can experience such a sequence. At this point, we can easily define a cost function by considering the expected discounted reward with respect to the generic distribution :

As the agent must maximize the expected future reward, must be maximized. Of course, we cannot work with the integral, hence, we need to approximate it with a sample mean:

In the previous formula, N is the total number of evaluated sequences, but in practice, it's easy to consider it as the batch size. In fact, our goal is to optimize using a stochastic gradient ascent approach (such as RMSProp or Adam). Unfortunately, is not immediately helpful because we don't know . However, we can perform some straightforward manipulations to compute the gradient :

To simplify the previous expression, we need to remember that:

Therefore, by applying this trick, we obtain:

In other words, we are approximating the true gradient with a sample mean where the only complexity is the computation of the gradient of . Once we have computed the gradient, we can update the parameter vector:

This problem can be solved either using directly the previous formula or, preferably, employing a more complex optimization algorithm like Adam. Of course, in practical implementations, it's important to remember that the goal is now a maximization; therefore, the sign of the logarithm must be inverted. In general, the expression is obtained using a dummy cross-entropy where p = 1:

Therefore, using a standard deep learning framework, it's enough to minimize .

Example of policy gradient with OpenAI Gym Cartpole

In this example, we want to test the policy gradient algorithm using a simple problem provided by OpenAI Gym (see https://gym.openai.com/envs/CartPole-v0/, where all installation instructions are also provided).

The goal is to find an optimal policy to balance a pole that is hinged upon a cart (as initially proposed by Barto et al. in Barto A., Sutton R., Anderson C., Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem, IEEE Transactions on Systems, Man, and Cybernetics, 1983). The cart can be moved in both directions, but the problem is considered solved when it remains very close to the center ( metric units) and the pole remains in a position of ±15° from the vertical line.

It's interesting to notice that this problem has been studied for a long time because its physical formulation is quite complex, and directly modeling a control system is consequently extremely difficult. Conversely, RL can solve this kind of problem using only a limited number of experiments, using a policy that can be easily parameterized such as an MLP.

Screenshot of the CartPole-v0 problem

The first step is initializing the environment:

import gym
env = gym.make('CartPole-v0')
env.seed(1000)

The state is a four-dimensional vector containing the following pieces of information:

Position of the cart x
Angle of the pole
Linear speed of the cart (first derivative )
Angular speed of the pole (first derivative

There are only two possible actions (either {-1,1} or {0,1}) indicating the movement at constant speed in one of the two directions. The policy is parameterized using a neural network employing ReLU units, which are very easy to optimize:

import tensorflow as tf
policy = tf.keras.models.Sequential([
    tf.keras.layers.Dense(32,
                          activation='relu',
                          input_dim=4),
    tf.keras.layers.Dense(32,
                          activation='relu'),
    tf.keras.layers.Dense(2,
                          activation='relu')
])
optimizer = tf.keras.optimizers.Adam()

As we want to employ the more robust functions provided by TensorFlow to compute the cross-entropy, the output is not directly a Softmax, but rather two ReLU neurons, which can represent two logits perfectly. The optimizer is Adam with the default learning rate . As we need to compute the gradients while selecting an action, we have created a single function:

def policy_step(s, grads=False):
    with tf.GradientTape() as tape:
        actions = policy(s, training=True)
        action = tf.random.categorical(
            actions, 1)
        action = tf.keras.utils.to_categorical(action, 2)
        loss = tf.squeeze(
            tf.nn.softmax_cross_entropy_with_logits(
                action, actions))
    if grads:
        gradients = tape.gradient(
            loss, policy.trainable_variables)
        return np.argmax(action), gradients
    return np.argmax(action)

If the parameter grads is set to True, the function computes and returns the gradients; otherwise, it only returns the actions that have been selected. The structure of the function is very easy:

The policy is evaluated using the current state.
An action is selected using a random choice based on the probability vector (p₀, p₁) (using the logits). If the policy has high entropy, , the action is completely random. This behavior encourages exploration. However, when the policy becomes more and more stable, one probability and, consequently . In this case, the agent starts exploiting the policy, with a very limited chance to perform further explorations.
The loss function is computed, as explained in the theoretical part, using a numerically stable cross-entropy.
The gradients are computed.

In order to perform the gradient ascent, it's also necessary to create an empty vector based on all trainable variables and a utility function to compute the discounted reward according to the formula:

The code for both functions is shown in the following snippet:

def create_gradients():
    gradients = policy.trainable_variables
    for i, g in enumerate(gradients):
        gradients[i] = 0
    return gradients
def discounted_rewards(r):
    dr = []
    da = 0.0
    for t in range(len(r)-1, -1, -1):
        da *= gamma
        da += r[t]
        dr.append(da)
    return dr[::-1]

Since we need to evaluate all possible sub-sequences, the discounted reward is computed considering all starting points from i = 1 to i = T, where T is the length of the sequence. It's also important to remember that the order is always inverted because the elements are appended to the tail of the list; therefore, the most recent transition is the last one.

At this point, we can start the training loop using to give more importance to the whole sequence. The environment provides a reward equal to 1.0 when the action is positive, and 0.0 when it's negative. Therefore, it's hard to understand whether a sequence should be considered a success. A simple trick is to penalize the last action if the episode is ended before the natural length (which is equal to 200 steps). In our case, every negative terminal state has a reward equal to -5.0, but this value can be tuned up further in order to maximize the convergence speed.

The batch size is equal to 5, and we're going to perform 2,000 episodes (of course, I invite the reader to test different values and observe the results):

nb_episodes = 2000
max_length = 200
batch_size = 5
gamma = 0.99
gradients = create_gradients()
global_rewards = []
for e in range(nb_episodes):
state = env.reset()
      e_gradients = []
      e_rewards = []
      done = False
      total_reward = 0.0
      t = 0
      while not done and t < max_length:
             env.render()
             state = np.reshape(state, (1, 4)).
                astype(np.float32)
             action, grads = policy_step(state,
                                        grads=True)
             state, reward, done, _ = env.step(action)
             total_reward += reward
             e_rewards.append(
                reward if not done else -5)
             grads = np.array(grads)
             e_gradients.append(grads)
             t += 1
       global_rewards.append(total_reward)
       d_rewards = discounted_rewards(e_rewards)
       for i, g in enumerate(e_gradients):
            gradients += g * d_rewards[i]
       if e > 1 and e % batch_size == 0:
            optimizer.apply_gradients(
                zip(gradients / batch_size,
                    policy.trainable_variables))
            gradients = create_gradients()
        print("Finished episode: {}. "
              "Total reward: {:.2f}".
              format(e + 1, total_reward))
env.close()

For each episode, the previous snippet performs a loop where it samples a state from the environment, performs an action (computing the gradients), observes the new state, and stores the immediate reward. Every five episodes, the sequences corrected using the discounted rewards are aggregated into the term and the gradients are applied using the Adam optimizer. The plot with the rewards is shown in the following figure. The highlighted line is a smoothed version:

Rewards of policy gradient applied to the CartPole-v0 problem

As we can see, the algorithm reaches the maximum score (equal to 200) very quickly, but it keeps oscillating (with substantial flattening after 1,500 episodes). This is a drawback of this method. In fact, while the estimation is unbiased, its variance is generally quite large. There are different approaches to mitigate this problem, but they might worsen the overall performance. For example, increasing the batch size reduces the number of corrections and yields a more stable policy. Another method, which is likely to be the best one, is called baseline subtraction and consists of using a slightly different cost function:

The term doesn't alter the unbiasedness, but it can be chosen in order to minimize the variance of the gradient. However, when using small batch sizes, this term must be recomputed before each update and it has only a limited effect due to the small sample size. Therefore, it's often preferable to employ an optimization algorithm with a decaying learning rate to limit the corrections when the policy has reached good stability and to employ a slightly larger number of explorative episodes where the oscillation is large. As an exercise, I invite the reader to test these different approaches to learn more about the advantages and the drawbacks of this method.

Summary

In this chapter, we presented the natural evolution of TD(0) based on an average of backups with different lengths. The algorithm, called TD(), is extremely powerful, and it ensures faster convergence than TD(0), with only a few (non-restrictive) conditions. We also showed how to implement the Actor-Critic method with TD(0) in order to learn about both a stochastic policy and a value function.

In later sections, we discussed two methods based on the estimation of the Q function: SARSA and Q-learning. They are very similar, but the latter has a greedy approach, and its performance (in particular the training speed) results in it being superior to SARSA. The Q-learning algorithm is one of the most important models for the latest developments. In fact, it was the first RL approach employed with a deep convolutional network to solve complex environments (like Atari games). For this reason, we also presented a simple example based on an MLP that processes visual input and outputs the Q values for each action. Moreover, we introduced the concept of direct policy search with an example based on OpenAI Gym, a free set of environments in which it's possible to experiment with RL algorithms.

The world of RL is extremely fascinating, and hundreds of researchers work every day to improve algorithms and solve more and more complex problems. I invite the reader to check the references in order to find useful resources that can be exploited to obtain a deeper understanding of the models and their developments. Moreover, I suggest reading the blog posts written by the Google DeepMind team, which is a pioneer in the field of deep RL. I also suggest searching for papers freely available on arXiv (for example, Mnih V., Kavakcuoglu K., Silver D., Graves A., Antonoglou I, Wierstra D., Riedmiller M., Playing Atari with Deep Reinforcement Learning, arXiv:1312.5602v1 [cs.LG] or Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)).

I'm happy to end this book on this topic, because I believe that RL can provide new and powerful tools that will dramatically change our lives.

Table of Contents for
Advanced Policy Estimation Algorithms

25

Advanced Policy Estimation Algorithms

TD() algorithm

TD() in a more complex checkerboard environment

Actor-Critic TD(0) in the checkerboard environment

SARSA algorithm

SARSA in the checkerboard environment

Q-learning

Q-learning in the checkerboard environment

Q-learning modeling the policy with a neural network

Direct policy search through policy gradient

Example of policy gradient with OpenAI Gym Cartpole

Summary

Further reading

Table of Contents for Advanced Policy Estimation Algorithms

Create new playlist

Sign In

Sign Up

25

Advanced Policy Estimation Algorithms

TD() algorithm

TD() in a more complex checkerboard environment

Actor-Critic TD(0) in the checkerboard environment

SARSA algorithm

SARSA in the checkerboard environment

Q-learning

Q-learning in the checkerboard environment

Q-learning modeling the policy with a neural network

Direct policy search through policy gradient

Example of policy gradient with OpenAI Gym Cartpole

Summary

Further reading

Table of Contents for
Advanced Policy Estimation Algorithms