24

Introduction to Reinforcement Learning

In this chapter, we're going to introduce the fundamental concepts of Reinforcement Learning (RL), which is a set of approaches that allows an agent to learn how to behave in an unknown environment thanks to rewards that are provided after each possible action. RL has been studied for decades, but it has matured into a powerful approach in the last few years, with advances making it possible to employ deep learning models together with standard (and often simple) algorithms in order to solve extremely complex problems (such as learning how to play an Atari game perfectly).

In particular, we will discuss:

  • The concept of the Markov Decision Process (MDP)
  • The concepts of environment, agent, policy, and reward
  • The policy iteration algorithm
  • The value iteration algorithm
  • The TD(0) algorithm

We can now introduce the main concepts that characterize a reinforcement learning scenario, focusing on the features of each element and how they interact to reach a global objective.

Fundamental concepts of RL

Imagine that you want to learn to ride a bike, and ask a friend for advice. They explain how the gears work, how to release the brake and a few other technical details. In the end, you ask the secret to keeping your balance.

What kind of answer do you expect? In an imaginary supervised world, you should be able to perfectly quantify your actions and correct errors by comparing the outcomes with precise reference values. In the real world, you have no idea about the quantities underlying your actions and, above all, you will never know what the right value is.

Increasing the level of abstraction, the scenario we're considering can be described as: a generic agent performs actions inside an environment and receives feedback that is somehow proportional to the competence of its actions. According to this Feedback, the Agent can correct its actions in order to reach a specific goal. This basic schema is represented in the following diagram:

Basic RL schema

Returning to our initial example, when you ride a bike for the first time and try to keep your balance, you will notice that the wrong movement causes an increase in the slope, which in turn increases the horizontal component of the gravitational force, pushing the bike laterally. As the vertical component is compensated, the result is a rotation that ends when the bike falls to the ground completely. However, as you can use your legs to control your balance, when the bike starts falling, thanks to Newton's third law, the force on your leg increases and your brain understands that it's necessary to make a movement in the opposite direction.

Even though this problem can be easily expressed in terms of physical laws, nobody learns to ride a bike by computing forces and momentums. This is one of the main concepts of RL: an agent must always make its choices considering a piece of information, usually defined as a reward that represents the response provided by the environment. If the action is correct, the reward will be positive, otherwise, it will be negative. After receiving a reward, an agent can fine-tune the strategy, called the policy, in order to maximize the expected future reward.

For example, after a few rides, you will be able to slightly move your body so as to keep your balance while turning, but in the beginning, you'll probably need to extend your leg to avoid falling down.

Hence, your initial policy suggested an incorrect action, which received repeated negative rewards; so your brain corrected it by increasing the probability of choosing another action. The implicit hypothesis that underlies this approach is that an agent is always rational, meaning that its goal is to maximize the expected return of its actions (nobody decide to fall off their bike just to find out how it feels).

Before discussing the single components of an RL system, it's necessary to add a couple of fundamental assumptions. The first one is that an agent can repeat experiences an infinite number of times. In other words, we assume that it's possible to learn a valid policy (possibly the optimal one) only if we have enough time. Clearly, this is unacceptable in the animal world, and we all know that many experiences are extremely dangerous; however, this assumption is necessary to prove the convergence of some algorithms. Indeed, sub-optimal policies can sometimes be learned very quickly, but it's necessary to iterate many times to reach the optimal one.

In real artificial systems, we always stop the learning process after a finite number of iterations, but it's almost impossible to find valid solutions if some experiences prevent the agent from continuing to interact with the environment. As many tasks have final states (either positive or negative), we assume that the agent can play any number of episodes (somewhat analogous to the epochs of supervised learning), exploiting the experience previously learned.

The second assumption is a little bit more technical and it's usually known as the Markov property.

The Markov Decision Process

When the agent interacts with the environment, it observes a sequence of states. Even though it might seem like an oxymoron, we assume that each state is stateful. We can explain this concept with a simple example; suppose that you're filling a tank and every 5 seconds you measure the level. Imagine that at t = 0, the level L = 10 and the water is flowing in. What do you expect at t = 1? Obviously, L > 10. In other words, without external unknown causes, we assume that a state contains the previous history, so that the sequence, even if discretized, represents a continuous evolution where no jumps are allowed.

When an RL task satisfies this property, it's called a Markov Decision Process (MDP) and it's very easy to employ simple algorithms to evaluate the actions. Luckily, the majority of natural events can be modeled as MDPs (when you're walking toward a door, every step in the right direction must decrease the distance), but there are some games that are implicitly stateless.

For example, if you want to employ an RL algorithm to learn how to guess the outcome of a probabilistic sequence of independent events (such as tossing a coin), the result could be dramatically wrong. The reason is clear: any state is independent of the previous ones and every attempt to build up a history is a failure. Therefore, if you observe a sequence of 0, 0, 0, 0... you are not justified in increasing the value of betting on 0 unless, after considering the likelihood of the events, you suppose that the coin is loaded.

However, if there's no reason to do so, the process isn't an MDP and every episode (event) is completely independent. All the assumptions that we, either implicitly or explicitly, make are based on this fundamental concept, so pay attention when evaluating new, unusual scenarios because you may discover that the employment of a specific algorithm isn't theoretically justified.

Environment

The environment is the entity within which the agent has to reach its goals. For our purposes, a generic environment is a system that receives an input action, at (we use the index t because this is a natural time process), and outputs a tuple composed of a state, st+1, and a reward, rt+1. These two elements are the only pieces of information provided to the agent to make its next decision. If we are working with an MDP and the sets of possible actions, A, and states, S, are discrete and finite, the problem is a defined finite MDP (in many continuous cases, it's possible to treat the problem as a finite MDP by discretizing the spaces). If there are final states, the task is called episodic and, in general, the goal is to reach a positive final state in the shortest amount of time or maximize a score. The schema of the cyclic interaction between an agent and an environment is shown in the following diagram:

Agent-environment interaction schema

A very important feature of an environment is its internal nature. It can be either deterministic or stochastic. A deterministic environment is characterized by a function that associates each possible action in a specific state st to a well-defined successor st+1, with a precise reward rt+1:

Conversely, a stochastic environment is characterized by a transition probability between the current state st and a set of possible successors given an action at:

If a state si has a transitional probability , the state is defined as absorbing. In general, all ending states in episodic tasks are modeled as absorbing ones, to avoid any further transition. When an episode is not limited to a fixed number of steps, the only criterion to determine its end is to check whether the agent has reached an absorbing state. As we don't know which state will be the successor, it's necessary to consider the expected value of all possible rewards considering the initial state st and the action at:

In general, it's easier to manage stochastic environments because they can be immediately converted into deterministic ones by setting all probabilities to zero except the one corresponding to the actual successor (for example, )). In the same way, the expected return can be set equal to rt+1. Knowledge of , as well as , is necessary to employ some specific algorithms, but it can become problematic when finding a suitable model for the environment requires extremely complex analysis. In all those cases, model-free methods can be employed and, therefore, the environment is considered as a black box, whose output at time t (subsequent to an action performed by the agent at-1), is the only available piece of information for the evaluation of a policy.

Rewards

We have seen that rewards (sometimes negative rewards are called penalties, but it's preferable to use a standardized notation) are the only feedback provided by the environment after each action. However, there are two different approaches to the use of rewards. The first one is the strategy of a very short-sighted agent and consists of taking into account only the reward just received.

The main problem with this approach is clearly the inability to consider longer sequences that can lead to a very high reward. For example, an agent has to traverse a few states with a negative reward (for example, -0.1), but after them, they arrive at a state with a very positive reward (for example, +5.0). A short-sighted agent couldn't find out the best policy because it would simply try to avoid the immediate negative rewards. On the other hand, it's better to suppose that a single reward contains a part of the future rewards that will be obtained following the same policy. This concept can be expressed by introducing a discounted reward, which is defined as:

In the previous expression, we are assuming an infinite horizon with a discount factor , which is a real number bounded between 0 and 1 (not included). When , the agent is extremely short-sighted, because of Rt = rt+1, but when , the current reward takes into account the future contributions discounted in a way that is inversely proportional to the time-step. In this way, very close rewards will have a higher weight than very distant ones. If the absolute value of all rewards is limited by a maximum immediate absolute reward , the previous expression will always be bounded. In fact, considering the properties of a geometric series, we get:

Clearly, the right choice of is a crucial factor in many problems and cannot be easily generalized. As in many other similar cases, I suggest testing different values, picking the one that minimizes the convergence speed while yielding a quasi-optimal policy. Of course, if the tasks are episodic with length T(ei), the discounted reward becomes:

A checkerboard environment in Python

We are going to consider an example based on a checkerboard environment representing a tunnel. The goal of the agent is to reach the ending state (the lower-right corner), avoiding 10 wells that are negative absorbing states. The rewards are:

  • Ending state: +5.0
  • Wells: -5.0
  • All other states: -0.1

Selecting a small negative reward for all non-terminal states is helpful to force the agent to move forward until the maximum (final) reward has been achieved. Let's start modeling an environment that has a 5 × 15 matrix:

import numpy as np
width = 15
height = 5
y_final = width - 1
x_final = height - 1
y_wells = [0, 1, 3, 5, 5, 7, 9, 11, 12, 14]
x_wells = [3, 1, 2, 0, 4, 1, 3, 2, 4, 1]
standard_reward = -0.1
tunnel_rewards = np.ones(shape=(height, width)) * 
                 standard_reward
for x_well, y_well in zip(x_wells, y_wells):
    tunnel_rewards[x_well, y_well] = -5.0
tunnel_rewards[x_final, y_final] = 5.0

The graphical representation of the environment (in terms of rewards) is shown in the following chart:

Rewards in the tunnel environment

The agent can move in four directions: up, down, left, and right. Clearly, in this case, the environment is deterministic because every action moves the agent to a predefined cell. We assume that whenever an action is forbidden (such as trying to move to the left when the agent is in the first column), the successor state is the same one (with the corresponding reward).

Policy

Formally, a policy is a deterministic or stochastic law that the agent follows in order to maximize its return. Conventionally, all policies are denoted by the letter . A deterministic policy is usually a function of the current state that outputs a precise action:

A stochastic policy, analogously to environments, outputs the probability of each action (in this case, we are assuming we work with a finite MPD):

However, contrary to the environment, an agent must always pick a specific action, transforming any stochastic policy into a deterministic sequence of choices. In general, a policy where is called soft and it's often very useful during the training process because it allows more flexible modeling without the premature selection of a suboptimal action. Instead, when and , the policy is also defined as hard. This transformation can be performed in many ways, but the most common one is to define a policy that is greedy with respect to a value (we're going to discuss this concept in the next section). This means that, at every step, the policy will select the action that maximizes the value of the successor state. Obviously, this is a very rational approach, which could be too pragmatic. In fact, when the values of some states don't change, a greedy policy will always force the agent to perform the same actions.

Such a problem is known as the exploration-exploitation dilemma and arises when it would be better to allow the agent to evaluate alternative strategies that could initially appear to be suboptimal. In other words, we want the agent to explore the environment before starting to exploit the policy, to know whether the policy is really the best one or whether there are hidden alternatives. To solve this problem, it's possible to employ an -greedy policy, where the value is called the exploration factor and represents a probability.

In this case, the policy will pick a random action with probability and a greedy one with probability . In general, at the beginning of the training process, is kept very close to 1.0 to incentivize the exploration and it's progressively decreased when the policy becomes more stable. In many Deep RL (DRL) applications, this approach is fundamental, in particular when there are no models of the environment. The reason is that greedy policies can be initially wrong and it's necessary to allow the agent to explore many possible state and action sequences before forcing a deterministic decision.

Policy iteration

In this section, we are going to analyze a strategy to find an optimal policy based on complete knowledge of the environment (in terms of transition probability and expected returns). The first step is to define a method that can be employed to build a greedy policy. Let's suppose we're working with a finite MDP and a generic policy ; we can define the intrinsic value of a state st as the expected discounted return obtained by the agent starting from st and following the stochastic policy :

In this case, we are assuming that, as the agent will follow , the state sa is more useful than sb if the expected return starting from sa is greater than the one obtained starting from sb. Unfortunately, trying to directly find the value of each state using the previous definition is almost impossible when . However, this is a problem that can be solved using dynamic programming (for further information, please refer to R. A. Howard, Dynamic Programming and Markov Process, The MIT Press, 1960), which allows us to solve the problem iteratively.

In particular, we need to turn the previous formula into a Bellman equation:

The first term on the right-hand side can be expressed as:

In other words, it is the weighted average of all expected returns considering that the agent is in the state st and evaluates all possible actions and the consequent state transitions. For the second term, we need a small trick. Let's suppose we start from st+1, so that the expected value corresponds to ; however, as the sum starts from st, we need to consider all possible transitions starting from st. In this case, we can rewrite the term as:

Again, the first terms take into account all possible transitions starting from st (and ending in st+1), while the second one is the value of each ending state. Therefore, the complete expression becomes:

For a deterministic policy, instead, the formula is:

The previous equations are particular cases of a generic discrete Bellman equation for a finite MDP that can be expressed as a vectorial operator applied to the value vector:

It's easy to prove that there is a unique fixed point that corresponds to , so . However, in order to solve the system, we need to consider all equations at the same time because, both on the left-hand and on the right-hand side of the Bellman equation, there is the term. Is it possible to transform the problem into an iterative procedure, so that a previous computation can be exploited for the following one? The answer is yes and it's the consequence of an important property of . Let's consider the infinity norm of the difference between two value vectors computed at time t and t+1:

As the discount factor , the Bellman operator is a -contraction that reduces the distance between the arguments by a factor of (they get more and more similar). The Banach Fixed-Point Theorem states that a contraction on a metric space D admits a unique fixed point , which can be found by repeatedly applying the contraction to any .

Hence, we know about the existence of a unique fixed point , which is the goal of our research. If we now consider a generic starting point V(t), and we compute the norm of the difference with , we obtain:

Repeating this procedure iteratively until t = 0, we get:

The term , while continuing the iterations over the distance between V(t) and gets smaller and smaller, authorizing us to employ the iterative approach instead of the one-shot closed method. Hence, the Bellman equation becomes:

This formula allows us to find the value for each state (the step is formally called policy evaluation), but of course, it requires a policy. In the first step, we can randomly select the actions because we don't have any other piece of information, but after a complete evaluation cycle, we can start defining a greedy policy with respect to the values. In order to achieve this goal, we need to introduce a very important concept in RL, the Q function (which must not be confused with the Q function defined in the EM algorithm), which is defined as the expected discounted return obtained by an agent starting from state st and selecting a specific action at:

The definition is very similar to , but in this case, we include the action, at, as a variable. Clearly, it's possible to define a Bellman equation for by simply removing the policy/action summation:

Sutton and Barto (in Sutton R. S., Barto A. G., Reinforcement Learning, The MIT Press, 1998) proved a simple but very important theorem (called the Policy Improvement Theorem), which states that given the deterministic policies and , if , then is better than or equal to . The proof is very compact and can be found in their book, however, the result can be understood intuitively. If we consider a sequence of states, and , while , the policy is at least equal to and it becomes better if at least inequality is strict. Conversely, if , this means that and, again, if there's at least a state si, where .

Hence, after a complete policy evaluation cycle, we are authorized to define a new greedy policy as:

This step is called policy improvement and its goal is to set the action associated with each state as the one that leads to the transition to the successor state with the maximum value. It's not difficult to understand that an optimal policy will remain stable when . In fact, when , the Q function will converge to a stable fixed point determined by and the term will always select the same actions. However, if we start with a random policy, in general, a single policy evaluation cycle isn't enough to assure the convergence. Therefore, after a policy improvement step, it's often necessary to repeat the evaluation and continue alternating the two phases until the policy becomes stable (that's why the algorithm is called policy iteration). In general, the convergence is quite fast, but the actual speed depends on the nature of the problem, the number of states and actions, and the consistency of the rewards.

The complete policy iteration algorithm (as proposed by Sutton and Barto) is:

  1. Set an initial deterministic random policy .
  2. Set the initial value array .
  3. Set a tolerance threshold Thr (for example, Thr = 0.0001).
  4. Set a maximum number of iterations Niter.
  5. Set a counter e = 0.
  6. While e < Niter:
    1. e = e + 1.
    2. While Avg(|V(s) – Vold(s)|) > Thr:
      1. Set .
      2. Perform a policy evaluation step reading the current value from Vold(s) and updating V(s).
    3. Set .
    4. Perform a policy improvement step.
    5. If :
      1. Break (Main while loop)
  7. Output the final deterministic policy .

In this case, as we have full knowledge of the environment; there's no need for an exploration phase. The policy is always exploited as it's built to be greedy to the real value (obtained when ).

Policy iteration in the checkerboard environment

We want to apply the policy iteration algorithm in order to find an optimal policy for the tunnel environment. Let's start by defining a random initial policy and a value matrix with all values (except the terminal states) equal to 0:

import numpy as np
nb_actions = 4
policy = np.random.randint(0, nb_actions,
                           size=(height, width)).
    astype(np.uint8)
tunnel_values = np.zeros(shape=(height, width))

The initial random policy (t = 0) is shown in the following chart:

Initial (t = 0) random policy

The states denoted by represent the wells, while the final positive one is represented by the capital letter E.

Hence, the initial value matrix (t = 0) is:

Initial (t = 0) value matrix

At this point, we need to define the functions to perform the policy evaluation and improvement steps. As the environment is deterministic, the processes are slightly simpler because of the generic transition probability:

In the same way, the policy is deterministic and only a single action is taken into account. The policy evaluation step is performed, freezing the current values and updating the whole matrix V(t+1) with V(t); however, it's also possible to use the new values immediately. I invite the reader to test both strategies in order to find the fastest way. In this example, we are employing a discount factor, (it goes without saying that an interesting exercise consists of testing different values and comparing the result of the evaluation process and the final behavior):

import numpy as np
def is_final(x, y):
    if (x, y) in zip(x_wells, y_wells) 
            or (x, y) == (x_final, y_final):
        return True
    return False

The code for policy evaluation is shown in the following snippet:

def policy_evaluation():
    old_tunnel_values = tunnel_values.copy()
    for i in range(height):
        for j in range(width):
            action = policy[i, j]
            if action == 0:
                if i == 0:
                    x = 0
                else:
                    x = i - 1
                y = j
            elif action == 1:
                if j == width - 1:
                    y = width - 1
                else:
                    y = j + 1
                x = i
            elif action == 2:
                if i == height - 1:
                    x = height - 1
                else:
                    x = i + 1
                y = j
            else:
                if j == 0:
                    y = 0
                else:
                    y = j - 1
                x = i
            reward = tunnel_rewards[x, y]
            tunnel_values[i, j] = 
                reward + 
                (gamma * old_tunnel_values[x, y])

In an analogous way, we can define the code necessary to perform the policy improvement step:

def policy_improvement():
    for i in range(height):
        for j in range(width):
            if is_final(i, j):
                continue
            values = np.zeros(shape=(nb_actions,))
            values[0] = (tunnel_rewards[i - 1, j] +
                         (gamma *
                          tunnel_values[i - 1, j])) 
                if i > 0 else -np.inf
            values[1] = (tunnel_rewards[i, j + 1] +
                         (gamma *
                          tunnel_values[i, j + 1])) 
                if j < width - 1 else -np.inf
            values[2] = (tunnel_rewards[i + 1, j] +
                         (gamma *
                          tunnel_values[i + 1, j])) 
                if i < height - 1 else -np.inf
            values[3] = (tunnel_rewards[i, j - 1] +
                         (gamma *
                          tunnel_values[i, j - 1])) 
                if j > 0 else -np.inf
            policy[i, j] = np.argmax(values).
                astype(np.uint8)

Once the functions have been defined, we start the policy iteration cycle (with a maximum number of epochs, Niter = 100,000, and a tolerance threshold equal to 10-5):

nb_max_epochs = 100000
tolerance = 1e-5
e = 0
gamma = 0.85
old_policy = np.random.randint(0, 
                               nb_actions, 
                               size=(height, width)).astype(np.uint8)
while e < nb_max_epochs:
e += 1
      old_tunnel_values = tunnel_values.copy()
      policy_evaluation()
      if np.mean(np.abs(tunnel_values - 
                          old_tunnel_values)) < 
                tolerance:
             old_policy = policy.copy()
         policy_improvement()
         if np.sum(policy - old_policy) == 0:
             break

At the end of the process (in this case, the algorithm converged after 182 iterations, but this value can change with different initial policies), the value matrix is:

Final value matrix

Analyzing the values, it's possible to see how the algorithm discovered that they are an implicit function of the distance between a cell and the ending state. Moreover, the policy always avoids the wells because the maximum value is always found in an adjacent state. It's easy to verify this behavior by plotting the final policy:

Final policy

Picking a random initial state, the agent will always reach the ending one, avoiding the wells and confirming the optimality of the policy iteration algorithm.

Value iteration

An alternative approach to policy iteration is provided by the value iteration algorithm. The main assumption is based on the empirical observation that the policy evaluation step converges rather quickly and it's reasonable to stop the process after a fixed number of steps (normally 1). In fact, policy iteration can be thought of as a game where the first player tries to find the correct values considering a stable policy, while the other player creates a new policy that is greedy with respect to the new values.

Clearly, the second step compromises the validity of the previous evaluation, forcing the first player to repeat the process. However, as the Bellman equation uses a single fixed point, the algorithm converges to a solution characterized by the fact that the policy doesn't change anymore and, consequently, the evaluation becomes stable. This process can be simplified by removing the policy improvement step and continuing the evaluation in a greedy fashion. Formally, each step is based on the following update rule:

Now the iteration doesn't consider the policy anymore (assuming implicitly that it will be greedy with respect to the values) and selects V(t+1) as the maximum possible value among all V(t)(at). In other words, value iteration anticipates the choice that is made by the policy improvement step by selecting the value that corresponds to the action that is likely () to be selected. It's not difficult to extend the convergence proof presented in the previous section to this case, therefore, , just as in policy iteration. However, the average number of iterations is normally smaller because we are starting with a random policy that can contrast the value iteration process. When the values become stable, the optimal greedy policy is simply obtained as:

This step is formally equivalent to a policy improvement iteration, which, however, is done only once, at the end of the process.

The complete value iteration algorithm (as proposed by Sutton and Barto) is:

  1. Set the initial value array .
  2. Set a tolerance threshold Thr (for example, Thr = 0.0001).
  3. Set a maximum number of iterations Niter.
  4. Set a counter e = 0.
  5. While e < Niter:
    1. e = e + 1.
    2. While Avg(|V(s) – Vold(s)|) > Thr:
      1. Set .
      2. Perform a value evaluation step reading the current value from Vold(s) and updating V(s).
  6. Output the final deterministic policy .

Value iteration in the checkerboard environment

To test this algorithm, we need to set an initial value matrix with all values equal to 0 (they can also be randomly chosen but, as we don't have any prior information on the final configuration, every initial choice is probabilistically equivalent):

import numpy as np
tunnel_values = np.zeros(shape=(height, width))

At this point, we can define the two functions to perform the value evaluation and the final policy selection (the function is_final() is the one defined in the previous example):

import numpy as np
def is_final(x, y):
    if (x, y) in 
            zip(x_wells, y_wells) or 
            (x, y) == (x_final, y_final):
        return True
    return False

We can now define the value evaluation function:

def value_evaluation():
    old_tunnel_values = tunnel_values.copy()
    for i in range(height):
        for j in range(width):
            rewards = np.zeros(shape=(nb_actions,))
            old_values = np.zeros(shape=(nb_actions,))
            for k in range(nb_actions):
                if k == 0:
                    if i == 0:
                        x = 0
                    else:
                        x = i - 1
                    y = j
                elif k == 1:
                    if j == width - 1:
                        y = width - 1
                    else:
                        y = j + 1
                    x = i
                elif k == 2:
                    if i == height - 1:
                        x = height - 1
                    else:
                        x = i + 1
                    y = j
                else:
                    if j == 0:
                        y = 0
                    else:
                        y = j - 1
                    x = i
                rewards[k] = tunnel_rewards[x, y]
                old_values[k] = old_tunnel_values[x, y]
            new_values = np.zeros(shape=(nb_actions,))
            for k in range(nb_actions):
                new_values[k] = rewards[k] + 
                                (gamma * old_values[k])
            tunnel_values[i, j] = np.max(new_values)

The next function we need is the policy selection one, shown in the following snippet:

def policy_selection():
    policy = np.zeros(shape=(height, width)).
        astype(np.uint8)
    for i in range(height):
        for j in range(width):
            if is_final(i, j):
                continue
            values = np.zeros(shape=(nb_actions,))
            values[0] = (tunnel_rewards[i - 1, j] +
                         (gamma *
                          tunnel_values[i - 1, j])) 
                if i > 0 else -np.inf
            values[1] = (tunnel_rewards[i, j + 1] +
                         (gamma *
                          tunnel_values[i, j + 1])) 
                if j < width - 1 else -np.inf
            values[2] = (tunnel_rewards[i + 1, j] +
                         (gamma *
                          tunnel_values[i + 1, j])) 
                if i < height - 1 else -np.inf
            values[3] = (tunnel_rewards[i, j - 1] +
                         (gamma *
                          tunnel_values[i, j - 1])) 
                if j > 0 else -np.inf
            policy[i, j] = np.argmax(values).
                astype(np.uint8)
    return policy

The main differences are in the value_evaluation() function, which now has to consider all possible successor states and select the value corresponding to the action that leads to the state with the highest value. Instead, the policy_selection() function is equivalent to policy_improvement(), but as it is invoked only once, it outputs directly to the final optimal policy. At this point, we can run a training cycle (assuming the same constants as before):

e = 0
policy = None
while e < nb_max_epochs:
e += 1
      old_tunnel_values = tunnel_values.copy()
      value_evaluation()
      if np.mean(np.abs(tunnel_values - 
                          old_tunnel_values)) < 
                tolerance:
       	policy = policy_selection()
             break

The final value configuration (after 127 iterations) is shown in the following chart:

Final value matrix

As in the previous example, the final value configuration is a function of the distance between each state and the ending one, but in this case, the choice of isn't optimal. In fact, the wells close to the final state aren't considered very dangerous anymore. Plotting the final policy can help us understand the behavior:

Final policy

As expected, the wells that are far from the target are avoided, but the two that are close to the final state are accepted as reasonable penalties. This happens because the value iteration algorithm is very greedy with respect to the value and the discount factor ; the effect of negative states can be compensated for by the final reward. In many scenarios, these states are absorbing, therefore their implicit reward is or , meaning that no other actions can change the final value.

I invite the reader to repeat the example with different discount factors (remember that an agent with is very short-sighted and will avoid any obstacle, even reducing the efficiency of the policy) and change the values of the final states. Moreover, you should be able to answer the question: What is the agent's behavior when the standard reward (whose default value is -0.1) is increased or decreased?

The TD(0) algorithm

One of the problems with dynamic programming algorithms is the need for full knowledge of the environment in terms of states and transition probabilities. Unfortunately, there are many cases where these pieces of information are unknown before the direct experience. In particular, states can be discovered by letting the agent explore the environment, but transition probabilities require us to count the number of transitions to a certain state and this is often impossible. Moreover, an environment with absorbing states can prevent visiting many states if the agent has learned a good initial policy. For example, in a game, which can be described as an episodic MDP, the agent discovers the environment while learning how to move forward without ending in a negative absorbing state.

A general solution to these problems is provided by a different evaluation strategy, called Temporal Difference (TD) RL. In this case, we start with an empty value matrix and we let the agent follow a greedy policy with respect to the value (except for the initial one, which is generally random). Once the agent observes a transition , due to an action, at, it updates the estimation of V(si) with a reward. The process is structured in episodes (which is the most natural way) and ends when a maximum number of steps have been done or a terminal state is met. In particular, the TD(0) algorithm updates the value according to the rule:

The constant is bounded between 0 and 1 and acts as a learning rate. Each update considers a variation with respect to the current value V(t)(si), which is proportional to the difference between the actual return and the previous estimation. The term is analogous to the one employed in the previous methods and represents the expected value given the current return and the discounted value starting from the successor state. However, as V(t)(sj) is an estimation, the process is based on a bootstrap from the previous values. In other words, we start from an estimation to determine the next one, which should be closer to the stable fixed point. Indeed, TD(0) is the simplest example of a family of TD algorithms that are based on a sequence (usually called a backup) that can be generalized as (considering k steps):

As we're using a single reward to approximate the expected discounted return, TD(0) is usually called a one-step TD method (or one-step backup). A more complex algorithm can be built considering more subsequent rewards or alternative strategies. We're going to analyze a generic variant called TD(λ) in the next chapter and explain why this algorithm corresponds to a choice of . TD(0) has been proven to converge, even if the proof (which can be found for a model-based approach in Van Hasselt H., Wiering M. A., Convergence of Model-Based Temporal Difference Learning for Control, Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007)) is more complex because it's necessary to consider the evolution of the Markov process. In fact, in this case, we are approximating the expected discounted return with both a truncated estimation and a bootstrap value V(sj), which is initially (and for a large number of iterations) unstable. However, assuming the convergence for , we get:

The last formula expresses the value of the state si, assuming that the greedy optimal policy forces the agent to perform the action that causes the transition to sj. Of course, at this point, it's natural to ask under which conditions the algorithm converges. In fact, we are considering episodic tasks and the estimation can be correct only if the agent performs a transition to si an infinite number of times, selecting all possible actions an infinite number of times. Such a condition is often expressed by saying that the policy must be Greedy in the Limit with Infinite Explorations (GLIE). In other words, the real greediness is achieved only as an asymptotic state when the agent is able to explore the environment without limitations for an unlimited number of episodes.

This is probably the most important limitation of TD Reinforcement Learning, because, in real-life scenarios, some states can be very unlikely and, hence, the estimation can never accumulate the experience needed to converge to the actual value. We are going to analyze some methods to solve this problem in the next chapter, but in our example, we employ a random start. In other words, as the policy is greedy and could always avoid some states, we force the agent to start each episode in a random nonterminal cell. In this way, we allow deep exploration even with a greedy policy. Whenever this approach is not feasible (because, for example, the environment dynamics are not controllable), the exploration-exploitation dilemma can be solved only by employing an -greedy policy, which selects a fraction of suboptimal (or even wrong) actions. In this way, it's possible to observe a higher number of transitions paying the price of slower convergence.

However, as pointed out by Sutton and Barto, TD(0) converges to the maximum-likelihood estimation of the value function determined by the MDP, finding the implicit transition probabilities of the model. Therefore, if the number of observations is high enough, TD(0) can quickly find an optimal policy, but at the same time, it's also more sensitive to biased estimations if some couples' state-action is never experienced (or experienced very seldom). In our example, we don't know which the initial state is, hence selecting a fixed starting point yields a policy that is extremely rigid and almost completely unable to manage noisy situations.

For example, if the starting point is changed to an adjacent (but never explored) cell, the algorithm could fail to find the optimal path to the positive terminal state. On the other hand, if we know that the dynamics are well defined, TD(0) will force the agent to select the actions that are most likely to produce the optimal result given the current knowledge of the environment. If the dynamics are partially stochastic, the advantage of an -greedy policy can be understood considering a sequence of episodes where the agent experiences the same transitions and the corresponding values are increased proportionally. If, for example, the environment changes one transition after many experiences, the agent has to face a brand-new experience when the policy is already almost stable.

The correction requires many episodes and, as this random change has a very low probability, it's possible that the agent will never learn the correct behavior. Instead, by selecting a few random actions, the probability of encountering a similar state (or even the same one) increases (think about a game where the state is represented by a screenshot) and the algorithm can become more robust with respect to very unlikely transitions.

The complete TD(0) algorithm is:

  1. Set an initial deterministic random policy .
  2. Set the initial value array .
  3. Set the number of episodes Nepisodes.
  4. Set a maximum number of steps per episode Nmax.
  5. Set a constant (for example, ).
  6. Set a constant (for example, ).
  7. Set a counter e = 0.
  8. For i = 1 to Nepisodes:
    1. Observe the initial state si.
    2. While sj is non-terminal and e < Nmax:
      1. e = e + 1.
      2. Select the action .
      3. Observe the transition .
      4. Update the value function for the state si.
      5. Set si = sj.
  9. Update the policy to be greedy with respect to the value function .

At this point, we can test the TD(0) algorithm in the checkerboard environment.

TD(0) in the checkerboard environment

The first step in testing TD(0) in the checkerboard environment is to define an initial random policy, and a value matrix with all elements equal to 0:

import numpy as np
policy = np.random.randint(0, 
                           nb_actions, 
                           size=(height, width)).
    astype(np.uint8)
tunnel_values = np.zeros(shape=(height, width))

As we want to select a random starting point at the beginning of each episode, we need to define a helper function that must exclude the terminal states (all the constants are the same as previously defined):

import numpy as np
xy_grid = np.meshgrid(np.arange(0, height), 
                      np.arange(0, width), 
                      sparse=False)
xy_grid = np.array(xy_grid).T.reshape(-1, 2)
xy_final = list(zip(x_wells, y_wells))
xy_final.append([x_final, y_final])
xy_start = []
for x, y in xy_grid:
    if (x, y) not in xy_final:
        xy_start.append([x, y])
xy_start = np.array(xy_start)
def starting_point():
    xy = np.squeeze(xy_start[
                 np.random.randint(0,
                                   xy_start.shape[0],
                                   size=1)])
    return xy[0], xy[1]

Now we can implement the function to evaluate a single episode (setting the maximum number of steps equal to 500 and the constant to ):

max_steps = 1000
alpha = 0.25
def episode():
    (i, j) = starting_point()
    x = y = 0
    e = 0
    while e < max_steps:
        e += 1
        action = policy[i, j]
        if action == 0:
            if i == 0:
                x = 0
            else:
                x = i - 1
            y = j
        elif action == 1:
            if j == width - 1:
                y = width - 1
            else:
                y = j + 1
            x = i
        elif action == 2:
            if i == height - 1:
                x = height - 1
            else:
                x = i + 1
            y = j
        else:
            if j == 0:
                y = 0
            else:
                y = j - 1
            x = i
        reward = tunnel_rewards[x, y]
        tunnel_values[i, j] += 
            alpha * (reward +
                     (gamma * tunnel_values[x, y]) -
                     tunnel_values[i, j])
        if is_final(x, y):
            break
        else:
            i = x
            j = y

The function to determine the greedy policy with respect to the values is the same as was already implemented in the previous examples; however, we report it to guarantee the consistency of the example:

def policy_selection():
    for i in range(height):
        for j in range(width):
            if is_final(i, j):
                continue
            values = np.zeros(shape=(nb_actions,))
            values[0] = (tunnel_rewards[i - 1, j] +
                         (gamma * 
                          tunnel_values[i - 1, j])) 
                if i > 0 
                else -np.inf
            values[1] = (tunnel_rewards[i, j + 1] +
                         (gamma * 
                          tunnel_values[i, j + 1])) 
                if j < width - 1 
                else -np.inf
            values[2] = (tunnel_rewards[i + 1, j] +
                         (gamma * 
                          tunnel_values[i + 1, j])) 
                if i < height - 1 
                else -np.inf
            values[3] = (tunnel_rewards[i, j - 1] +
                         (gamma * 
                          tunnel_values[i, j - 1])) 
                if j > 0 
                else -np.inf
            policy[i, j] = np.argmax(values).
                astype(np.uint8)

At this point, we can start a training cycle with 5,000 episodes:

n_episodes = 5000
for _ in range(n_episodes):
    episode()
    policy_selection()

The final value matrix is shown in the following chart:

Final value matrix with random starts

As in the previous examples, the final values are inversely proportional to the distance from the final positive state. Let's analyze the resulting policy to understand whether the algorithm converged to a consistent solution:

Final policy with random starts

As can be seen, the random choice of the starting state is allowed to find the best path independently from the initial condition. To better understand the advantage of this strategy, let's plot the final value matrix when the initial state is fixed to the cell (0, 0), corresponding to the upper-left corner:

Final value matrix with a fixed initial state (0, 0)

Without any further analysis, it's possible to see that many states have never been visited or have been visited only a few times, and the resulting policy is therefore extremely greedy with respect to the specific initial state. The blocks containing values equal to -1.0 indicate states where the agent often has to pick a random action because there's no difference in the values, hence it can be extremely difficult to solve the environment with a different initial state.

The resulting policy confirms this analysis:

Final policy with a fixed initial state (0, 0)

As it's possible to see, the agent is able to reach the final state only when the initial point allows us to cross the trajectory starting from (0, 0). In all these cases, it's possible to recover the optimal policy, even if the paths are longer than the ones obtained in the previous example. Instead, states such as (0, 4) are clearly situations where there's a loss of policy. In other words, the agent acts without any knowledge or awareness and the probability of success converges to 0. As an exercise, I invite the reader to test this algorithm with different starting points (for example, a set of fixed ones) and higher α values. The goal is also to answer these questions: Is it possible to speed up the learning process? Is it necessary to start from all possible states in order to obtain a globally optimal policy?

Summary

In this chapter, we introduced the most important RL concepts, focusing on the mathematical structure of an environment as an MDP, and on the different kinds of policy and how they can be derived from the expected reward obtained by an agent. In particular, we defined the value of a state as the expected future reward considering a sequence discounted by a factor, γ. In the same way, we introduced the concept of the Q function, which is the value of an action when the agent is in a specific state.

These concepts directly employed the policy iteration algorithm, which is based on a Dynamic Programming approach assuming complete knowledge of the environment. The task is split into two stages; during the first one, the agent evaluates all the states given the current policy, while in the second one, the policy is updated in order to be greedy with respect to the new value function.

In this way, the agent is forced to always pick the action that leads to a transition that maximizes the obtained value.

We also analyzed a variant, called value iteration, that performs a single evaluation and selects the policy in a greedy manner. The main difference from the previous approach is that now the agent immediately selects the highest value, assuming that the result of this process is equivalent to a policy iteration. Indeed, it's easy to prove that, after infinite transitions, both algorithms converge on the optimal value function.

The last algorithm is called TD(0) and it's based on a model-free approach. In fact, in many cases, it's difficult to know all the transition probabilities and, sometimes, even all possible states are unknown. This method is based on the TD evaluation, which is performed directly while interacting with the environment. If the agent can visit all the states an infinite number of times (clearly, this is only a theoretical condition), the algorithm has been proven to converge to the optimal value function more quickly than other methods.

In the next chapter, we'll continue the discussion of RL algorithms, introducing some more advanced methods that can be immediately implemented using deep convolutional networks.

Further reading

  • R. A. Howard, Dynamic Programming and Markov Process, The MIT Press, 1960
  • Sutton R. S., Barto A. G., Reinforcement Learning, The MIT Press, 1998
  • Van Hasselt H., Wiering M. A., Convergence of Model-Based Temporal Difference Learning for Control, Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.115.195