© Nimish Sanghi 2021
N. SanghiDeep Reinforcement Learning with Pythonhttps://doi.org/10.1007/978-1-4842-6809-4_4

4. Model-Free Approaches

Nimish Sanghi1  
(1)
Bangalore, India
 

In the previous chapter, we looked at dynamic programming where we knew the model dynamics p(s, r| s, a), and this knowledge was used to “plan” the optimal actions. This is also known as the planning problem . In this chapter, we will shift our focus and look at learning problems , i.e., a setup where the model dynamics are not known. We will learn value and action-value functions by sampling, i.e., collecting experience by following some policy in the real world or by running the agent through a policy in simulation. There is another class of problems where we find the model-free approach more applicable. In some problems, it is easier to sample than to calculate the transition dynamics, e.g., consider solving a problem of finding the best policy to play a game like blackjack. There are many combinations to reach a score that depend on the cards seen so far and the cards still in the deck. It is almost impossible to calculate the exact transition probabilities from one state to another state but easy to sample states from an environment. To summarize, we use model-free methods when either we do not know the model dynamics or we know the model, but it is much more practical to sample than to calculate the transition dynamics.

In this chapter, we will look at two broad classes of model-free learning: Monte Carlo (MC) methods and temporal difference (TD) methods. We will first understand how policy evaluation works in a model-free setup and then extend our understanding to look at control, i.e., to find the optimal policy. We will also touch upon the important concept of bootstrapping and the exploration-versus-exploitation dilemma as well as off-policy versus on-policy learning. Initially, the focus will be to look at the MC and TD methods individually, after which we will look into additional concepts like n-step returns, importance sampling, and eligibility traces in a bid to combine the MC and TD methods into a common, more generic approach called TD(λ).

Estimation/Prediction with Monte Carlo

When we do not know the model dynamics, what do we do? Think back to a situation when you did not know something about a problem. What did you do in that situation? You experiment, take some steps, and find out how the situation responds. For example, say you want to find out if a die or a coin is biased or not. You toss the coin or throw the die multiple times, observe the outcome, and use that to form your opinion. In other words, you sample. The law of large numbers from statistics tell us that the average of samples is a good substitute for the averages. Further, these averages become better as the number of samples increase. If you look back at the Bellman equations in the previous chapter, you will notice that we had expectation operator E[•] in those equations; e.g., the value of a state being v(s) = E[Gt|St = s]. Further, to calculate v(s), we used dynamic programming requiring the transition dynamics p(s, r | s, a). In the absence of the model dynamics knowledge, what do we do? We just sample from the model, observing returns starting from state S = s and until the end of the episode. We then average the returns from all episode runs and use that average as an estimate of vπ(s) for the policy π that the agent is following. This in a nutshell is the approach of Monte Carlo methods: replace expected returns with the average of sample returns.

There are a few points to note. MC methods do not require knowledge of the model. The only thing required is that we should be able to sample from it. We need to know the return of starting from a state until termination, and hence we can use MC methods only on episodic MDPs in which every run finally terminates. It will not work on nonterminating environments. The second point is that for a large MDP we can keep the focus on sampling only that part of the MDP that is relevant and avoid exploring irrelevant parts of the MDP. Such an approach makes MC methods highly scalable for very large problems. A variant of the MC method called Monte Carlo tree search (MCTS) was used by OpenAI in training a Go game-playing agent.

The third point is about Markov assumption; i.e., the past is fully encoded in the present state. In other words, knowing the present makes the future independent of the past. We talked about this property in Chapter 2. This property of Markov independence forms the basis of the recursive nature of Bellman equations where a state just depends on the successor state values. However, under the MC approach, we observe the full return starting from a state S and until termination. We are not depending on the value of successor states to calculate the current state value. There is no Markov property assumption being made here. A lack of Markov assumption in MC methods make them much more feasible for the class of MDPs known as POMDPs (for “partially observable MDP”). In a POMDP environment, we get only partial-state information, which is known as observation.

Moving on, let’s look at a formal way to estimate the state values for a given policy. We let the agent start a new episode and observe that the return from the time agent is in state S = s for the first time in that episode and until the episode ends. Many episode runs are carried out, and an average of return across episodes is taken as an estimate of vπ(S = s). This is known as the first-visit MC method . Please note that, depending on the dynamics, an agent may visit the same state S = s in some later step within the same episode before termination. However, in the first-visit MC method, we take the total return only from the first visit in an episode until the end of the episode. There is another variant in which we take the average of returns from every visit to that state until the end of the episode. This is known as the every-visit MC method .

Figure 4-1 shows the backup diagram for the MC method. In DP we take a full swipe to cover all the actions possible from a state as well as all the possible transitions to a new state from the state-action pair (S = s, A = a). We go only one level deep from state S = s to A = a and then to the next state S = s. Compared to this, in the MC method, the backup covers a full sample trajectory from the current state S = s to the terminal state. It does not cover all the branching possibilities; rather, it covers only one single path that got sampled from the start state S = s to the terminal state.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig1_HTML.jpg
Figure 4-1

Backup diagram of MC methods as compared to Bellman equation–based DP backup

Let’s now look at the pseudocode for the first-visit version, which is given in Figure 4-2. We input the policy π that the agent is currently following. We initialize two arrays: one to hold the current estimate of V(s) and the second to hold the number of visits N(s) to a state S = s. Multiple episodes are executed, and we update V(s) as well as N(s) for the state agent visits in every episode, updating only for the first visit in the “first visit version” and updating for every visit in the “every visit version.” The pseudocode covers only the “first visit variant.” The “every visit” variant is easy to implement, just dropping the condition “Unless St appears in S0, S1, ….” As per the law of large numbers, the statistical law on which Monte Carlo simulations are based, V(s) will converge to true vπ(s) when the number of visits to each state goes to infinity.

First Visit MC Prediction
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig2_HTML.png
Figure 4-2

First-visit MC prediction for estimating vπ(s). The pseudocode uses the online version to update the values as samples are received

We store the cumulative total and the count of first visits to a state. The average is calculated by dividing the total by the count. Every time a state is visited, the following update is carried out:
$$ N(s)=N(s)+1;kern1em S(s)=S(s)+G;kern1.25em V(s)=S(s)/N(s) $$
With a minor change, we could look at the update in another way, without needing the cumulative totals S(s). In this alternate formulation, we will have one array V(s) that gets updated for every visit directly without needing to divide the total by the count. The derivation of this updated rule is as follows:
$$ N{(s)}_{n+1}=N{(S)}_n+1 $$
$$ V{(s)}_{n+1}=left[S{(s)}_n+G
ight]/N{(s)}_{n+1} $$
$$ =left[V{(s)}_nast N{(s)}_n+G
ight]/N{(s)}_{n+1} $$
$$ =V{(s)}_n+1/N{(s)}_{n+1}ast left[Ghbox{--} V{(s)}_n
ight] $$
(4.1)

The difference [G – V(s)n] can be viewed as an error between the latest sampled value G and the current estimate V(s)n. The difference/error is then used to update the current estimate by adding 1/N ∗ error to the current estimate. As the number of visits increases, the new sample has increasingly low influence in revising the estimate of V(s). This is because the multiplication factor 1/N reduces to zero as N becomes very large.

Sometimes, instead of using a diminishing factor 1/N, we can use a constant α as the multiplication factor in front of the difference [G – V(s)n].
$$ {V}_{n+1}={V}_n+upalpha left(Ghbox{--} {V}_n
ight) $$
(4.2)

The constant multiplication factor approach is more appropriate for problems that are nonstationary or when we want to give a constant weight to all the errors. A situation like this can happen when the old estimate Vn may not be very accurate.

Let’s now look at the implementation of MC value prediction. Listing 4-1 shows the code, and the full code is available in file listing4_1.ipynb.
def mc_policy_eval(policy, env, discount_factor=1.0, episode_count=100):
    # Start with (all 0) state value array and a visit count of zero
    V = np.zeros(env.nS)
    N = np.zeros(env.nS)
    i = 0
    # run multiple episodes
    while i < episode_count:
        #collect samples for one episode
        episode_states = []
        episode_returns = []
        state = env.reset()
        episode_states.append(state)
        while True:
            action = np.random.choice(env.nA, p=policy[state])
            (state, reward, done, _) = env.step(action)
            episode_returns.append(reward)
            if not done:
                episode_states.append(state)
            else:
                break
        #update state values
        G = 0
        count = len(episode_states)
        for t in range(count-1, -1, -1):
            s, r = episode_states[t], episode_returns[t]
            G = discount_factor * G + r
            if s not in episode_states[:t]:
                N[s] += 1
                V[s] = V[s] + 1/N[s] * (G-V[s])
        i = i+1
    return np.array(V)
Listing 4-1

MC Value Prediction Algorithm for Estimation

The code in Listing 4-1 is a straight implementation of the pseudocode in Figure 4-2. The code implements the online version of update as explained in equation (4.1), i.e., N[s]+=1; V[s]=V[s]+1/N[s]*(G-V[s]). The code also implements the “first-visit” version but can be converted to “every visit” with a very small tweak. We just need to drop the “if” check, i.e., “if s not in episode_states[:t]” to carry out the updates at every step.

To ensure convergence to the true state values, we need to ensure each state is visited enough times, being infinite in limit. As shown in the results at the end of the Python notebook, state values do not converge very well for 100 episodes. However, with 10,000 episodes, the values have converged well and match those produced with the DP method given in Listing 3-2.

Bias and Variance of MC Predication Methods

Let’s now look at the pros and cons of “first visit” versus “every visit.” Do both of them converge to the true underlying V(s)? Do they fluctuate a lot while converging? Does one converge faster to true value? Before we answer this question, let’s first review the basic concept of bias-variance trade-off that we see in all statistical model estimations, e.g., in supervised learning.

Bias refers to the property of the model to converge to the true underlying value that we are trying to estimate, in our case vπ(s). Some estimators are biased, meaning they are not able to converge to the true value due to their inherent lack of flexibility, i.e., being too simple or restricted for a given true model. At the same time, in some other cases, models have bias that goes down to zero as the number of samples grows.

Variance refers to the model estimate being sensitive to the specific sample data being used. This means the estimate value may fluctuate a lot and hence may require a large data set or trials for the estimate average to converge to a stable value.

The models, which are very flexible, have low bias as they are able to fit the model to any configuration of a data set. At the same time, due to flexibility, they can overfit to the data, making the estimates vary a lot as the training data changes. On the other hand, models that are simpler have high bias. Such models, due to the inherent simplicity and restrictions, may not be able to represent the true underlying model. But they will also have low variance as they do not overfit. This is known as bias-variance trade-off and can be presented in a graph as shown in Figure 4-3.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig3_HTML.jpg
Figure 4-3

Bias variance trade-off. Model complexity is increasing toward the right on the x-axis. Bias starts off high when the model is restricted and drops off as the model becomes flexible. Variance shows a reverse trend with model complexity

Comparing “first visit” to “every visit,” first visit is unbiased but has high variance. Every visit has bias that goes down to zero as the number of trials increases. In addition, every visit has low variance and usually converges to the true value estimates faster than first visit.

Control with Monte Carlo

Let’s now talk about control in a model-free setup. We need to find the optimal policy in this setup without knowing the model dynamics. As a refresher, let’s look at the generalized policy iteration (GPI) that was introduced in Chapter 3. In GPI, we iterate between two steps. The first step is to find the state values for a given policy, and the second step is to improve the policy using greedy optimization. We will follow the same GPI approach for control under MC. We will have some tweaks, though, to account for the fact that we are in model-free world with no access/knowledge of transition dynamics.

In Chapter 3, we looked at state values, v(s). However, in the absence of transition dynamics, state values alone will not be sufficient. For the greedy improvement step, we need access to the action values, q(s, a). We need to know the q-values for all possible actions, i.e., all q(S = s, a) for all possible actions a in state S = s. Only with that information will we be able to apply a greedy maximization to pick the best action, i.e., argmaxaq(s, a).

We have another complication when compared to DP. The agent follows a policy at the time of generating the samples. However, such a policy may result in many state-action pairs never being visited, and even more so if the policy is a deterministic one. If the agent does not visit a state-action pair, it does not know all q(s, a) for a given state, and hence it cannot find the maximum q-value yielding an action. One way to solve the issue is to ensure enough exploration by exploring starts, i.e., ensuring that the agent starts an episode from a random state-action pair and over the course of many episodes covers each state-action pair enough times, in fact, infinite in limit.

Figure 4-4 shows the GPI diagram with the change of v-values to q-values. The evaluation step now is the MC prediction step that was introduced in the previous section. Once the q-values stabilize, greedy maximization can be applied to obtain a new policy. The policy improvement theorem ensures that the new policy will be better or at least as good as the old policy. The previous approach of GPI will be a recurring theme. Based on the setup, the evaluation steps will change, and the improvement step invariably will continue to be greedy maximization.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig4_HTML.jpg
Figure 4-4

Iteration between the two steps. The first step is evaluation to get the state-action values in sync with the policy being followed. The second step is that of policy improvement to do a greedy maximization for actions

The assumption of “exploring starts” is not very practical and efficient. It’s not practical, as in many scenarios, because the agent does not get to choose the start condition, e.g., while training a self-driving car. It’s not efficient, as it may not be feasible, and it may also be wasteful for the agent to visit each state-action pair infinite (in limit) times. We still need continued exploration for the agent to visit all the actions in the states that are visited by the current policy. This is achieved by using a ε-greedy policy.

In a ε-greedy policy, agent takes the action with the maximum q-value with the probability 1-ε, and it takes any action randomly with probability ε/|A|. In other words, the remaining probability of ε is divided equally across all actions to ensure that the agent continues to explore nonmaximizing actions. In other words, the agent exploits the knowledge with probability 1-ε, and it explores with probability ε.
$$ pi left(a|s
ight)=left{egin{array}{c}1-varepsilon +frac{varepsilon }{left|A
ight|}kern1.25em for a={argmax}_a Qleft(s,a
ight)\ {}frac{varepsilon }{mid Amid}kern3em otherwisekern5.5em end{array}
ight. $$
(4.3)

The action with a maximum q-value gets picked with probability 1-ε from greedy max and ε/|A| as part of random exploration. All other actions are picked up with probability ε/|A|. As the agent learns over multiple episodes, the value of ε can be reduced slowly to zero to reach optimal greedy policy.

Let’s make one more refinement to the estimation/prediction step of the iteration. In the previous section, we saw that even for a simple 4×4 grid MDP, we needed an order of 10,000 episodes for the values to converge. Refer to Figure 3-9, which shows the convergence of GPI: the MC prediction step puts the v-values or q- values in sync with the current policy. However, in the previous chapter, we also saw that there was no need for the agent to go all the way to the convergence of values. Similarly, we could run MC prediction followed by policy improvement on an episode-by-episode basis. This approach will remove the need for a large number of iterations in the estimation/prediction step, thereby making the approach scalable for large MDPs. In contract to the convergence diagram in Figure 3-9, such an approach will produce convergence as shown in Figure 4-5.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig5_HTML.jpg
Figure 4-5

Iteration between the two steps. The first step is the MC prediction/evaluation for a single step to move the q-values in the direction of the current policy. The second step is that of policy improvement to do a ε-greedy maximization for actions

Combining all the previous tweaks, we have a practical algorithm for Monte Carlo–based control for optimal policy learning. This is called greedy in the limit with infinite exploration (GLIE). We will use the every-visit version in the q-value prediction step. But it takes a minor tweak, just like the one in MC prediction in Figure 4-6 to make it the first-visit variant.

GLIE For Policy Optimization
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig6_HTML.png
Figure 4-6

Every-visit (GLIE) MC control for policy optimization

Let’s now look at the actual implementation given in listing4_2.ipynb. Listing 4-2 reproduces the relevant parts of the code. The implementation follows the pseudocode. We have some extra functions like argmax_a() that help us find the action with max Q(s,a) for a given state. Another function, get_action(state), returns a ε-greedy action for the current policy. You are encouraged to make tweaks and convert it into the “first visit” variant. You may also want to then compare the result of “first visit” versus “every visit” MC control.
def GLIE(env, discount_factor=1.0, episode_count=100):
    """
    Find optimal policy given an environment.
    Returns:
        Vector of length env.nS representing the value function.
        policy: [S, A] shaped matrix representing the policy. Random in our case
    """
    # Start with (all 0) state value array and state-action matrix.
    # also initialize visit count to zero for the state-action visit count.
    V = np.zeros(env.nS)
    N = np.zeros((env.nS, env.nA))
    Q = np.zeros((env.nS, env.nA))
    #random policy
    policy = [np.random.randint(env.nA) for _ in range(env.nS)]
    k = 1
    eps = 1
    def argmax_a(arr):
        """
        Return idx of max element in an array.
        Break ties uniformly.
        """
        max_idx = []
        max_val = float('-inf')
        for idx, elem in enumerate(arr):
            if elem == max_val:
                max_idx.append(idx)
            elif elem > max_val:
                max_idx = [idx]
                max_val = elem
        return np.random.choice(max_idx)
    def get_action(state):
        if np.random.random() < eps:
            return np.random.choice(env.nA)
        else:
            return argmax_a(Q[state])
    # run multiple episodes
    while k <= episode_count:
        #collect samples for one episode
        episode_states = []
        episode_actions = []
        episode_returns = []
        state = env.reset()
        episode_states.append(state)
        while True:
            action = get_action(state)
            episode_actions.append(action)
            (state, reward, done, _) = env.step(action)
            episode_returns.append(reward)
            if not done:
                episode_states.append(state)
            else:
                break
        #update state-action values
        G = 0
        count = len(episode_states)
        for t in range(count-1, -1, -1):
            s, a, r = episode_states[t], episode_actions[t], episode_returns[t]
            G = discount_factor * G + r
            N[s, a] += 1
            Q[s, a] = Q[s, a] + 1/N[s, a] * (G-Q[s, a])
        #Update policy and optimal value
        k = k+1
        eps = 1/k
        #uncomment this to have higher exploration initially
        #and then let epsilon decay after 5000 episodes
        #if k <=100:
        #    eps = 0.02
        for s in range(env.nS):
            best_action = argmax_a(Q[s])
            policy[s] = best_action
            V[s] = Q[s,best_action]
    return np.array(V), np.array(policy)
Listing 4-2

GLIE MC Control Algorithm

So far in this chapter we have studied on-policy algorithms for prediction and control. In other words, the same policy is used to generate samples as well as policy improvement. Policy improvement is done with respect to q-values of the same ε-greedy policies. We need exploration to find Q(s, a) for all state-action pairs, which is achieved by using ε-greedy policies. However, as we progress, the ε-greedy policy with a limit is suboptimal. We need to exploit more as our understanding of the environment grows with more and more episodes. We have also seen that while theoretically the policy will converge to an optimal policy, at times it requires careful control of the ε value. The other big disadvantage of MC methods is that we need the episodes to finish before we can update q-values and carry out policy optimization. Therefore, MC methods can only be applied to environments that are episodic and where the episodes terminate. Starting in the next section, we will start looking at a different class of algorithms called temporal difference methods that combine the advantages of DP (single time-step updates) and MC (i.e., no need to know system dynamics). However, before we dive deep into TD methods, let’s go through a short section to introduce and then compare on-policy versus off-policy learning.

Off-Policy MC Control

In GLIE, we saw that to explore enough, we needed to use ε-greedy policies so that all state actions are visited often enough in limit. The policy learned at the end of the loop is used to generate the episodes for the next iteration of the loop. We are using the same policy to explore as the one that is being maximized. Such an approach is called on-policy where samples are generated from the same policy that is being optimized.

There is another approach in which the samples are generated using a policy that is more exploratory with a higher ε, while the policy being optimized is the one that may have a lower ε or could even be a fully deterministic one. Such an approach of using a different policy to learn than the one being optimized is called off-policy learning. The policy being used to generate the samples is called the behavior policy , and the one being learned (maximized) is called the target policy . Let’s look at Figure 4-7 for the pseudocode of the off-policy MC control algorithm.

Off Policy MC Control Optimization
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig7_HTML.png
Figure 4-7

Off-policy MC control for policy optimization

Temporal Difference Learning Methods

Refer to Figure 4-1 to study the backup diagrams of the DP and MC methods. In DP, we back up the values over only one step using values from the successor states to estimate the current state value. We also take an expectation over action probabilities based on the policy being followed and then from the (s, a) pair to all possible rewards and successor states.
$$ {v}_{pi }(s)={sum}_api left(a|s
ight){sum}_{s^{prime },r}pleft({s}^{prime },r
ight|s,aleft[r+gamma {v}_{pi}left({s}^{prime}
ight)
ight] $$

The value of a state vπ(s) is estimated based on the current estimate of the successor states vπ(s). This is known as bootstrapping . The estimate is based on another set of estimates. The two sums are the ones that are represented as branch-off nodes in the DP backup diagram in Figure 4-1. Compared to DP, MC is based on starting from a state and sampling the outcomes based on the current policy the agent is following. The value estimates are averages over multiple runs. In other words, the sum over model transition probabilities is replaced by averages, and hence the backup diagram for MC is a single long path from one state to the terminal state. The MC approach allowed us to build a scalable learning approach while removing the need to know the exact model dynamics. However, it created two issues: the MC approach works only for episodic environments, and the updates happen only at the end of the termination of an episode. DP had the advantage of using an estimate of the successor state to update the current state value without waiting for an episode to finish.

Temporal difference learning is an approach that combines the benefits of both DP and MC, using bootstrapping from DP and the sample-based approach from MC. The update equation for TD is as follows:
$$ V(s)=V(s)+alpha left[R+gamma ast Vleft({s}^{prime}
ight)-V(s)
ight] $$
(4.4)

The current estimate of the total return for state S = s, i.e., Gt, is now given by bootstrapping from the current estimate of the successor state (s) shown in the sample run. In other words, Gt in equation (4.2) is replaced by R + γ ∗ V(s), an estimate. Compared to this, in the MC method, Gt was the discounted total return for the sample run.

The TD approach described is known as TD(0), and it is a one-step estimate. The reason for calling it TD(0) will become clearer toward the end of the chapter when we talk about TD(λ). For clarity’s sake, let’s look at the pseudocode for value estimation using TD(0) given in Figure 4-8.

TD(0) For Estimation
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig8_HTML.png
Figure 4-8

TD(0) policy estimation

Having seen all three approaches, DP, MC, and TD, let’s put the backup diagrams of all three approaches together, as shown in Figure 4-9. TD(0) is similar to DP as it uses bootstrapping, which is an estimate of the next state values to estimate the current state values. TD(0) is similar to MC as it samples the episodes and uses the observed reward and the next state to update the estimates.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig9_HTML.jpg
Figure 4-9

Backup diagram of DP, MC, and TD(0) compared

Let’s now introduce another quantity that you are likely to see again and again in reinforcement learning literature. In equation (4.4), the quantity [R + γ ∗ V(s)] is the revised estimate of the total return from state S = s, and it is based on a one-step backup from successor states. Further, in equation (4.4), V(s) on the right side of assignment is the current estimate. The update is moving the estimate from V(s) to V(s) + α (Difference) where the difference δt is given by the following expression:
$$ {delta}_t={R}_{t+1}+gamma ast Vleft({S}_{t+1}
ight)-Vleft({S}_t
ight) $$
(4.5)

The difference δt is known as TD error . It represents the error (difference) in the estimate of V(s) based on reward Rt + 1 plus the discounted next time-step state value V(St + 1) and the current estimate of V(s), i.e., V(St). The value of the state is moved by a proportion (learning rate, step rate α) of this error δt.

TD has an advantage over DP as it does not require the model knowledge (transition function). A TD method has an advantage over MC as TD methods can update the state value at every step; i.e., they can be used in an online setup through which we learn as the situation unfolds without waiting for the end of episode.

Temporal Difference Control

This section will start taking you into the realm of the real algorithms used in the RL world. In the remaining sections of the chapter, we will look at various methods used in TD learning. We will start with a simple one-step on-policy learning method called SARSA. This will be followed by a powerful off-policy technique called Q-learning. We will study some foundational aspects of Q-learning in this chapter, and in the next chapter we will integrate deep learning with Q-learning, giving us a powerful approach called Deep Q Networks (DQN). Using DQN, you will be able to train game-playing agents on an Atari simulator. In this chapter, we will also cover a variant of Q-learning called expected SARSA, another off-policy learning algorithm. We will then talk about the issue of maximization bias in Q-learning, taking us to double Q-learning. All the variants of Q-learning become very powerful when combined with deep learning to represent the state space, which will form the bulk of next chapter. Toward the end of this chapter, we will cover additional concepts such as experience replay, which make off-learning algorithms efficient with respect to the number of samples needed to learn an optimal policy. We will then talk about a powerful and a bit involved approach called TD(λ) that tries to combine MC and TD methods on a continuum. Finally, we will look at an environment that has continuous state space and how we can binarize the state values and apply the previously mentioned TD methods. The exercise will demonstrate the need for the approaches that we will take up in the next chapter, covering functional approximation and deep learning for state representation. After Chapters 5 and 6 on deep learning and DQN, we will show another approach called policy optimization that revolve around directly learning the policy without needing to find the optimal state/action values.

We have been using the 4×4 grid world so far. We will now look at a few more environments that will be used in the rest of the chapter. We will write the agents in an encapsulated way so that the same agent/algorithm could be applied in various environments without any changes.

The first environment we will use is a variant of the grid world; it is part of the Gym library called the cliff-walking environment. In this environment, we have a 4×12 grid world, with the bottom-left cell being the start state S and the bottom-right state being the goal state G. The rest of the bottom row forms a cliff; stepping on it earns a reward of -100, and the agent is put back to start state again. Each time a step earns a reward of -1 until the agent reaches the goal state. Similar to the 4×4 grid world, the agent can take a step in any direction [UP, RIGHT, DOWN, LEFT]. The episode terminates when the agent reaches the goal state. Figure 4-10 depicts the setup.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig10_HTML.jpg
Figure 4-10

4×10 cliff world. There is a reward of -100 for stepping on the cliff and -1 for other transitions. The goal is to start from S and reach G with the best possible reward

The next environment we will look at is the “taxi problem.” There are four locations marked R, G, Y, and B. When an episode starts, a passenger is randomly standing in one of these four locations. One of the four squares is chosen randomly as the destination location. There are 25 possible locations for a taxi to be found. The passenger has five locations: a start position or an inside taxi. There are four destinations. All these combinations take the possible values of state space to be 500 different combinations. The state value is expressed as a tuple of (taxi_row, taxi_col, passenger_location, destination). There are six actions that can be taken deterministically. The taxi can move North, South, West, East, and then there are two actions of pick or drop passenger. The rewards are -1 per time step, +20 for successful pick/drop, and -10 for pick/drop from the wrong locations. The episode ends when the passenger is picked up and dropped off successfully. Figure 4-11 shows the schematic of the environment.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig11_HTML.png
Figure 4-11

5×5 taxi problem. There is a reward of -1 for all transitions. An episode ends when the passenger is dropped off at the destination

We will now look at the third environment known as CartPole. We covered this environment in Chapter 2. In this state, the space is a continuous one comprising four observations: [Cart position, Car velocity, Pole angle, pole angular velocity]. The agent has two actions: push cart to left or push cart to right, i.e., two discrete actions. The reward is +1 at every time step, and the agent wants to maximize the reward by keeping the pole balanced for the longest time interval. The episode terminates once the pole angle is more than 12 degrees in either direction or the cart position is beyond 2.4 from the center, i.e., either < -2.4 or > 2.4 or if the agent is able to balance the pole for 200 time steps. Figure 4-12 shows a snapshot of the environment in action.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig12_HTML.jpg
Figure 4-12

Cart pole problem. The objective is to balance the pole vertically for the maximum number of time steps with a reward of +1 for every time step

Having explained the various environments, let’s now jump into solving these with TD algorithms, starting first with SARSA, an on-policy method.

On-Policy SARSA

Like the MC control methods, we will again leverage GPI. We will use a TD-driven approach for the policy value estimation/prediction step and will continue to use greedy maximization for the policy improvement. Just like with MC, we need to explore enough and visit all the states an infinite number of times to find an optimal policy. Similar to the MC method, we can use a ε-greedy policy and slowly reduce the ε value to zero, i.e., for the limit bring down the exploration to zero.

TD setup is model-free; i.e., we have no prior comprehensive knowledge of transitions. At the same time, to be able to maximize the return by choosing the right actions, we need to know the state-action values Q(S, A). We can reformulate TD estimation from equation (4.4) to the one in (4.6), essentially replacing V(s) with Q(s, a). Both the setups are Markov processes, with equation (4.4) focusing on state-to-state transitions and now the focus being state-action to state-action.
$$ Qleft({S}_t,{A}_t
ight)=Qleft({S}_t,{A}_t
ight)+alpha ast left[{R}_{t+1}+gamma ast Qleft({S}_{t+1},{A}_{t+1}
ight)-Qleft({S}_t,{A}_t
ight)
ight] $$
(4.6)
Similar to equation (4.5), the TD error is now given in context of q-values.
$$ {delta}_t={R}_{t+1}+gamma ast Qleft({S}_{t+1},{A}_{t+1}
ight)-Qleft({S}_t,{A}_t
ight) $$
(4.7)

To carry out an update as per equation (4.6), we need all the five values St, At, Rt + 1, St + 1, and At + 1. This is the reason the approach is called SARSA (state, action, reward, state, action). We will follow a ε-greedy policy to generate the samples, update the q-values using (4.6), and then based on the updated q-values create a new ε-greedy. The policy improvement theorem guarantees that the new policy will be better than the old policy unless the old policy was already optimal. Of course, for the guarantee to hold, we need to bring down the exploration probability ε to zero in the limit.

Please also note that for all episodic policies, the terminal states have Q(S, A) equal to zero; i.e., once in terminal state, the person cannot transition anywhere and will keep getting a reward of zero. This is another way of saying that the episode ends and Q(S, A) is zero for all terminal states. Therefore, when St + 1 is a terminal state, equation (4.6) will have Q(St + 1, At + 1) = 0, and the update equation will look like this:
$$ Qleft({S}_t,{A}_t
ight)=Qleft({S}_t,{A}_t
ight)+alpha ast left[{R}_{t+1}-Qleft({S}_t,{A}_t
ight)
ight] $$
(4.8)

Let’s now look at the pseudocode of the SARSA algorithm; see Figure 4-13 .

SARSA On-Policy TD Control
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig13_HTML.png
Figure 4-13

SARSA, on-policy TD control

Let’s now walk through the code from listing4_3.ipynb, which implements SARSA. The code has a class named SARSAAgent that implements the SARSA learning agent. It has two key functions: update(), which takes the values of state, action, reward, next_state, and next_action, as well as the done flag. It updates the q-value using the TD update equations (4.6) and (4.8). The other function is get_action(state), which returns a random action with probability ε and argmax Q(S,A) with probability 1-ε. There is a generic function train_agent() outside the class that trains the agent in a given environment. There are two helper functions: plot_rewards() to plot the reward per episode as the training progresses and print_policy() to print the optimal policy learned. The listing shows only the code for SARSAAgent. For the rest of the code, refer to listing4_3.ipynb (Listing 4-3).
# SARSA Learning agent class
from collections import defaultdict
class SARSAAgent:
    def __init__(self, alpha, epsilon, gamma, get_possible_actions):
        self.get_possible_actions = get_possible_actions
        self.alpha = alpha
        self.epsilon = epsilon
        self.gamma = gamma
        self._Q = defaultdict(lambda: defaultdict(lambda: 0))
    def get_Q(self, state, action):
        return self._Q[state][action]
    def set_Q(self, state, action, value):
        self._Q[state][action] = value
    # carryout SARSA updated based on the sample (S, A, R, S', A')
    def update(self, state, action, reward, next_state, next_action, done):
        if not done:
            td_error = reward + self.gamma * self.get_Q(next_state, next_action) /
- self.get_Q(state,action)
        else:
            td_error = reward - self.get_Q(state,action)
        new_value = self.get_Q(state,action) + self.alpha * td_error
        self.set_Q(state, action, new_value)
    # get argmax for q(s,a)
    def max_action(self, state):
        actions = self.get_possible_actions(state)
        best_action = []
        best_q_value = float("-inf")
        for action in actions:
            q_s_a = self.get_Q(state, action)
            if q_s_a > best_q_value:
                best_action = [action]
                best_q_value = q_s_a
            elif  q_s_a == best_q_value:
                best_action.append(action)
        return np.random.choice(np.array(best_action))
    # choose action as per ε-greedy policy
    def get_action(self, state):
        actions = self.get_possible_actions(state)
        if len(actions) == 0:
            return None
        if np.random.random() < self.epsilon:
            a = np.random.choice(actions)
            return a
        else:
            a = self.max_action(state)
            return a
Listing 4-3

SARA On-Policy TD Control

Figure 4-14 shows the graph of per-episode-reward as learning progresses and the optimal policy learned. We can see that the reward soon gets close to the optimal value. The policy learned is to avoid the cliff by first going all the way up and then taking a right turn to walk toward the goal. This is surprising as we would have expected the agent to learn the policy to skirt over the cliff and reach the goal, which would have been the shortest path by four steps as compared to the one learned by agent.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig14_HTML.jpg
Figure 4-14

Reward graph during learning under SARSA and policy learned by agent

However, as our policy continues to explore using ε-greedy, there is always a small chance that in a state when the agent is next to a cliff cell, it takes a random action and falls off into the cliff. It demonstrates the issue of continued exploration even when enough has been learned about the environment, i.e., when the same ε-greedy policy is used for sampling as well as for improvement. We will see how this issue is avoided in Q-learning in which off-policy learning is carried out using an exploratory behavior policy to generate training samples and a deterministic policy is learned as the optimal target policy.

Next, we will study our first off-policy TD algorithm called Q-learning.

Q-Learning: An Off-Policy TD Control

In SARSA, we used the samples with the values S, A, R, S, and A that were generated by the following policy. Action A from state S was produced using the ε-greedy policy, the same policy that was then improved in the “improvement” step of GPI. However, instead of generating A from the policy, what if we looked at all the Q(S, A) and chose the action A, which maximizes the value of Q(S’,A’) across actions A available in state S? We could continue to generate the samples (S, A, R, S) (notice no A as the fifth value in this tuple) using an exploratory policy like ε-greedy. However, we improve the policy by choosing A to be argmaxAQ(S, A). This small change in the approach creates a new way to learn the optimal policy called Q-learning. It is no more an on-policy learning, rather an off-policy control method where the samples (S, A, R, S) are being generated by an exploratory policy, while we maximize Q(S, A) to find a deterministic optimal target policy.

We are using exploration with the ε-greedy policy to generate the samples (S, A, R, S). At the same time, we are exploiting the existing knowledge by finding the Q maximizing action argmaxAQ(S, A) in state S. We will have lot more to say about these trade-offs between exploration and exploitation in Chapter 9.

The update rule for q-values is now defined as follows:    
$$ Qleft({S}_t,{A}_t
ight)leftarrow Qleft({S}_t,{A}_t
ight)+alpha ast left[{R}_{t+1}+gamma ast {mathit{max}}_{A_{t+1}}Qleft({S}_{t+1},{A}_{t+1}
ight)-Qleft({S}_t,{A}_t
ight)
ight] $$
(4.10)

Comparing the previous equation with equation (4.8), you will notice the subtle difference between the two approaches and how that makes Q-learning an off-policy method. The off-policy behavior of Q-learning is handy, and it makes the sample efficient. We will touch upon this in a later section when we talk about experience replay or replay buffer. Figure 4-15 gives the pseudocode of Q-learning.

Q-learning Off-Policy TD Control
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig15_HTML.png
Figure 4-15

Q-learning, off-policy TD control

The code in listing4_4.ipynb implements Q-learning. Like SARSA, it has a Q-learning agent that is similar to the SARSA agent in Listing 4-3 except for the change in the update rule, which follows (4.10) for Q-learning. The learning function also has a minor change from SARSA. We do not need the fifth value anymore in SARSA, i.e., the next_action A from S. We now choose the best action A while in state S to improve the policy similar to the Bellman optimality equation. The rest of the implementation from SARSA is carried over to Q-learning. Listing 4-4 shows the implementation of QLearningAgent. Please note carefully the changes in the update rule as compared to that of the SARSA agent.
# Q- Learning agent class
from collections import defaultdict
class QLearningAgent:
    def __init__(self, alpha, epsilon, gamma, get_possible_actions):
        self.get_possible_actions = get_possible_actions
        self.alpha = alpha
        self.epsilon = epsilon
        self.gamma = gamma
        self._Q = defaultdict(lambda: defaultdict(lambda: 0))
    def get_Q(self, state, action):
        return self._Q[state][action]
    def set_Q(self, state, action, value):
        self._Q[state][action] = value
    # Q learning update step
    def update(self, state, action, reward, next_state, done):
        if not done:
            best_next_action = self.max_action(next_state)
            td_error = reward + self.gamma * self.get_Q(next_state, best_next_action) - self.get_Q(state,action)
        else:
            td_error = reward - self.get_Q(state,action)
        new_value = self.get_Q(state,action) + self.alpha * td_error
        self.set_Q(state, action, new_value)
    # get best A for Q(S,A) which maximizes the Q(S,a) for actions in state S
    def max_action(self, state):
        actions = self.get_possible_actions(state)
        best_action = []
        best_q_value = float("-inf")
        for action in actions:
            q_s_a = self.get_Q(state, action)
            if q_s_a > best_q_value:
                best_action = [action]
                best_q_value = q_s_a
            elif  q_s_a == best_q_value:
                best_action.append(action)
        return np.random.choice(np.array(best_action))
    # choose action as per ε-greedy policy for exploration
    def get_action(self, state):
        actions = self.get_possible_actions(state)
        if len(actions) == 0:
            return None
        if np.random.random() < self.epsilon:
            a = np.random.choice(actions)
            return a
        else:
            a = self.max_action(state)
            return a
Listing 4-4

Q Learning Off-Policy TD Control

Let’s look at Q-learning being applied on the cliff world environment. The reward per episode improves with training and reaches the optimal value of -13 as compared to that of -17 under SARSA. As shown in Figure 4-16, a better policy under Q-learning is evident. Under Q-learning, the agent learns to navigate to the goal through the cells in the second row just above the cliff. Under Q-learning, the agent is learning a deterministic policy, and our environment is also deterministic. In other words, if the agent takes an action to move right, it will definitely move right and has zero chance of taking any random step in any other direction. Therefore, the agent learns the most optimal policy of going toward the goal by just grazing over the cliff.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig16_HTML.jpg
Figure 4-16

Reward graph during learning under Q-learning and policy learned by agent

The file listing4_4.ipynb also shows the learning curve when the Q-agent is applied to the taxi world environment. We will be revisiting Q-learning in next chapter on the deep learning approach to the state value approximation. Under that setup, Q-learning is called DQN. DQN with its variants will be used to train a game-playing agent on some Atari games.

To conclude the discussion on Q-learning, let’s look at a particular issue that Q-learning may introduce, that of maximization bias.

Maximization Bias and Double Learning

If you look back at equation (4.10), you will notice that we are maximizing over A to get the max value Q(S, A). Similarly, in SARSA, we find a new ε-greedy policy that is also maximizing over Q to get the action with highest q-value. Further, these q-values are estimates themselves of the true state-action values. In summary, we are using a max over the q-estimate as an “estimate” of the maximum value. Such an approach of “max of estimate” as an “estimate of max” introduces a +ve bias.

To see this, consider a scenario where the reward in some transition takes three values: 5, 0, +5 with an equal probability of 1/3 for each value. The expected reward is zero, but the moment we see a +5, we take that as part of the maximization, and then it never comes down. So, +5 becomes an estimate of the true reward that otherwise in expectation is 0. This is a positive bias introduced due to maximization step.

One of the ways to remove the +ve bias is to use a set of two q-values. One q-value is used to find the action that maximizes the q-value, and the other set of q-values is then used to find the q-value for that max action. Mathematically, it can be represented as follows:

Replace maxAQ(S, A) with Q1(S, argmaxAQ2(S, A)).

We are using Q2 to find the maximizing action A, and then Q1 is used to find the maximum q-value. It can be shown that such an approach removes the +ve or maximization bias. We will revisit this concept when we talk about DQN.

Expected SARSA Control

Let’s look at another method that is kind of a hybrid between Q-learning and SARSA; it’s called expected SARSA . It is similar to Q-learning except that “max” in (4.10) is replaced with an expectation, as shown here:   
$$ Qleft({S}_t,{A}_t
ight)leftarrow Qleft({S}_t,{A}_t
ight)+alpha ast left[{R}_{t+1}+gamma ast {sum}_api left(a|{S}_{t+1}
ight).Qleft({S}_{t+1},a
ight)-Qleft({S}_t,{A}_t
ight)
ight] $$
(4.11)

Expected SARSA has a lower variance as compared to the variance seen in SARSA due to the random selection of At + 1. In expected SARSA, instead of sampling, we take the expectation over all possible actions.

In the cliff world problem, we have deterministic actions, and hence we can set the learning rate α = 1 without any major impact on the learning quality. We give the pseudocode for the algorithm. It mirrors Q-learning except for the update logic of taking expectation instead of maximization. We can run expected SARSA as on-policy, which is what we will do while testing it against the cliff and taxi worlds. It can also be run off-policy where the behavior policy is more exploratory and the target policy π follows a deterministic greedy policy. See Figure 4-17.

Expected SARSA TD Control
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig17_HTML.png
Figure 4-17

Expected SARA TD control

The file listing4_5.ipynb shows the code for the expected SARSA agent. It is similar to the Q-agent except for the replacement of the maximization by the expectation. This change increases the computational complexity of the algorithm a bit but offers faster convergence than SARSA and Q-learning. Listing 4-5 reproduces the code for the expected SARSA agent class.
# Expected SARSA Learning agent class
class ExpectedSARSAAgent:
    def __init__(self, alpha, epsilon, gamma, get_possible_actions):
        self.get_possible_actions = get_possible_actions
        self.alpha = alpha
        self.epsilon = epsilon
        self.gamma = gamma
        self._Q = defaultdict(lambda: defaultdict(lambda: 0))
    def get_Q(self, state, action):
        return self._Q[state][action]
    def set_Q(self, state, action, value):
        self._Q[state][action] = value
    # Expected SARSA Update
    def update(self, state, action, reward, next_state, done):
        if not done:
            best_next_action = self.max_action(next_state)
            actions = self.get_possible_actions(next_state)
            next_q = 0
            for next_action in actions:
                if next_action == best_next_action:
                    next_q += (1-self.epsilon+self.epsilon/len(actions))* self.get_Q(next_state, next_action)
                else:
                    next_q += (self.epsilon/len(actions))* self.get_Q(next_state, next_action)
            td_error = reward + self.gamma * next_q - self.get_Q(state,action)
        else:
            td_error = reward - self.get_Q(state,action)
        new_value = self.get_Q(state,action) + self.alpha * td_error
        self.set_Q(state, action, new_value)
    # get best A for Q(S,A) which maximizes the Q(S,a) for actions in state S
    def max_action(self, state):
        actions = self.get_possible_actions(state)
        best_action = []
        best_q_value = float("-inf")
        for action in actions:
            q_s_a = self.get_Q(state, action)
            if q_s_a > best_q_value:
                best_action = [action]
                best_q_value = q_s_a
            elif  q_s_a == best_q_value:
                best_action.append(action)
        return np.random.choice(np.array(best_action))
    # choose action as per ε-greedy policy for exploration
    def get_action(self, state):
        actions = self.get_possible_actions(state)
        if len(actions) == 0:
            return None
        if np.random.random() < self.epsilon:
            a = np.random.choice(actions)
            return a
        else:
            a = self.max_action(state)
            return a
Listing 4-5

Expected SARSA TD Control

Figure 4-18 shows the result of training the expected SARSA agent for the cliff world. It has the fastest convergence to the optimal value compared to SARSA and Q-learning. You can experiment and see that the changing learning rate α has no major impact on the convergence in the limit. It is also interesting to note that the policy learned by the expected SARA falls between Q-learning and SARSA. The agent under this policy goes through the middle row of the maze to reach the goal. In our case, we are using the expected SARSA as on-policy, i.e., using the same ε-greedy policy to explore and improve. But possibly due to the expectation, it learns to improve from regular SARSA and finds it safe enough to go through the middle row. In the Python notebook, you can also see the result of running this algorithm against the taxi world.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig18_HTML.jpg
Figure 4-18

Reward graph with expected SARSA in the cliff world and policy learned by the agent

Replay Buffer and Off-Policy Learning

Off-policy learning involves two separate policies: behavior policy b(a| s) to explore and generate examples; and π(a| s), the target policy that the agent is trying to learn as the optimal policy. Accordingly, we could use the samples generated by the behavior policy again and again to train the agent. The approach makes the process sample efficient as a single transition observed by the agent can be used multiple times.

This is called experience replay . The agent is collecting experiences from the environment and replaying those experiences multiple times as part of the learning process. In experience replay, we store the samples (s, a, r, s', done) in a buffer. The samples are generated using an exploratory behavior policy while we improve a deterministic target policy using q-values. Therefore, we can always use older samples from a behavior policy and apply them again and again. We keep the buffer size fixed to some predetermined size and keep deleting the older samples as we collect new ones. The process makes learning sample efficient by reusing a sample multiple time. The rest of the approach remains the same as an off-policy agent.

Let’s apply this approach to the Q-learning agent. This time we will skip giving the pseudocode as there is hardly any change except for using samples from the replay buffer multiple times in each transition. We store a new transition in the buffer and then sample batch_size samples from the buffer. These samples are used to train the Q-agent in the usual way. The agent then takes another step in the environment, and the cycle begins again. Listing4_6.ipynb gives the implementation of the replay buffer and how it is used in the learning algorithm. See Listing 4-6.
class ReplayBuffer:
    def __init__(self, size):
        self.size = size #max number of items in buffer
        self.buffer =[] #array to hold buffer
    def __len__(self):
        return len(self.buffer)
    def add(self, state, action, reward, next_state, done):
        item = (state, action, reward, next_state, done)
        self.buffer = self.buffer[-self.size:] + [item]
    def sample(self, batch_size):
        idxs = np.random.choice(len(self.buffer), batch_size)
        samples = [self.buffer[i] for i in idxs]
        states, actions, rewards, next_states, done_flags = list(zip(*samples))
        return states, actions, rewards, next_states, done_flags
# training algorithm with reply buffer
def train_agent(env, agent, episode_cnt=10000, tmax=10000,
anneal_eps=True, replay_buffer = None, batch_size=16):
    episode_rewards = []
    for i in range(episode_cnt):
        G = 0
        state = env.reset()
        for t in range(tmax):
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            if replay_buffer:
                replay_buffer.add(state, action, reward, next_state, done)
                states, actions, rewards, next_states, done_flags = replay_buffer(batch_size)
                for i in range(batch_size):
                    agent.update(states[i], actions[i], rewards[i], next_states[i], done_flags[i])
            else:
                agent.update(state, action, reward, next_state, done)
            G += reward
            if done:
                episode_rewards.append(G)
                # to reduce the exploration probability epsilon over the
                # training period.
                if anneal_eps:
                    agent.epsilon = agent.epsilon * 0.99
                break
            state = next_state
    return np.array(episode_rewards)
Listing 4-6

Q-Learning with Replay Buffer

The Q-agent with the replay buffer is supposed to improve the initial convergence by sampling repeatedly from the buffer. The sample efficiency will become more apparent when we look at DQN. Over the long run, there won’t be any significant difference between the optimal values learned with or without the replay buffer. It has another advantage of breaking the correlation between samples. This aspect will also become apparent when we look at deep learning with Q-learning, i.e., DQN.

Q-Learning for Continuous State Spaces

Until now all the examples we have looked at had discrete state spaces. All the methods studied so far could be categorized as tabular methods. The state action space was represented as a matrix with states along one dimension and actions along the cross-axis.

We will soon transition to continuous state spaces and make heavy use of deep learning to represent the state through a neural net. However, we can still solve many of the continuous state problems with some simple approaches. In preparation for the next chapter, let’s look at the simplest approach of converting continuous values into discrete bins. The approach we will take is to round off continuous floating-point numbers with some precision, e.g., for a continuous state space value between -1 to 1 being converted into -1, -0.9, -0.8, … 0, 0.1, 0.2, … 1.0.

listing4_7.ipynb shows this approach in action. We will continue to use the Q-learning agent, experience reply, and learning algorithm from listing4_6. However, this time we will be applying the learning on a continuous environment, that of CartPole, which was described in detail at the beginning of the chapter. The key change that we need is to receive the state values from environment, discretize the values, and then pass this along to the agent as observations. The agent only gets to see the discrete values and uses these discrete values to learn the optimal policy using QAgent. We reproduce in Listing 4-7 the approach used for converting continuous state values into discrete ones. See Figure 4-19.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig19_HTML.jpg
Figure 4-19

Reward graph with Q-learning on CartPole continuous state space environment

# We will use ObservationWrapper class from gym to wrap our environment.
# We need to implement observation() method which will receive the
# original state values from underlying environment
# In observation() we will discretize the state values
# which then will be passed to outside world by env
# the agent will use these discrete state values
# to learn an effective policy using q-learning
from gym.core import ObservationWrapper
class Discretizer(ObservationWrapper):
    def observation(self, state):
        discrete_x_pos = round(state[0], 1)
        discrete_x_vel = round(state[1], 1)
        discrete_pole_angle = round(state[2], 1)
        discrete_pole_ang_vel = round(state[3], 1)
        return (discrete_x_pos, discrete_x_vel,
                discrete_pole_angle, discrete_pole_ang_vel)
Listing 4-7

Q-Learning (Off-Policy) on Continuous State Environment

The Q-agent with discrete discretization of states is able to get about 50 rewards compared to the maximum reward of 200. In subsequent chapters, we will study other more powerful methods to obtain more rewards.

n-Step Returns

In this section, we will unify the MC and TD approaches. MC methods sample the return from a state until the end of the episode, and they do not bootstrap. Accordingly, MC methods cannot be applied for continuing tasks. TD, on the other hand, uses one-step return to estimate the value of the remaining rewards. TD methods take a short view of the trajectory and bootstrap right after one step.

Both the methods are two extremes, and there are many situations when a middle-of-the-road approach could produce lot better results. The idea in n-step is to use the rewards from the next n steps and then bootstrap from n+1 step to estimate the value of the remaining rewards. Figure 4-20 shows the backup diagrams for various values of n. On one extreme is one-step, which is the TD(0) method that we just saw in the context of SARSA, Q-learning, and other related approaches. At the other extreme is the ∞-step TD, which is nothing but an MC method. The broad idea is to see that the TD and MC methods are two extremes of the same continuum.
../images/502835_1_En_4_Chapter/502835_1_En_4_Fig20_HTML.jpg
Figure 4-20

Backup diagrams of n-step methods with TD(0) and MC at two extremes, n=1 and n=∞ (end of episode)

The approach of n-step can be used in an on-policy setting. We can have n-step SARSA and n-step expected SARSA, and these are natural extensions of what we have learned so far. However, using an n-step approach in off-policy learning would require us to factor one more concept, the relative difference of observing a specific n-step transition across states under behavior policy versus that under target-policy. To use the data from behavior policy b(a| s) for optimizing target policy π(a| s), we need to multiply the n-step returns observed under the behavior policy with a ratio called importance sampling ratio .

Consider a starting state St and the trajectory of action and state sequences until the end of the episode, At, St + 1, At + 1, …. , ST. The probability of observing the sequence under a policy π is given by the following:
$$ mathit{Pr}left{ trajectory
ight}=pi left({A}_t
ight|{S}_t.pleft({S}_{t+1}
ight|{S}_t,{A}_t.pi left({A}_{t+1}
ight|{S}_{t+1}dots dots ..pleft({S}_T
ight|{S}_{T-1},{A}_{T-1} $$
The importance sampling ratio is the ratio of the probability of trajectory under the target policy π and the probability of trajectory under behavior policy b.
$$ {
ho}_{t:T-1}=frac{{mathit{Pr}}_{pi}left{ trajectory
ight}}{{mathit{Pr}}_bleft{ trajectory
ight}}={prod}_{k=t}^{T-1}frac{pi left({A}_kvee {S}_k
ight)}{bleft({A}_kvee {S}_k
ight)} $$
(4.12)

The importance sampling ratio ensures that the return of a trajectory observed under the behavior policy is adjusted up or down based on the relative chance of observing a trajectory under the target policy versus the chance of observing the same trajectory under the behavior policy.

Nothing comes for free, which is the case with importance sampling. The importance sampling ratio can cause wide variance. Plus, these are not computationally efficient. There are many advanced techniques like discount-aware importance sampling and per-decision importance sampling , which look at the importance sampling and rewards in various different ways to reduce the variance and also to make these algorithms efficient.

We will not be going into the details of implementing these algorithms in this book. Our key focus was to introduce these at a conceptual level and make you aware of the advanced techniques.

Eligibility Traces and TD(λ)

Eligibility traces unify the MC and TD methods in an algorithmically efficient way. TD methods when combined with eligibility trace produce TD(λ) where λ = 0, making it equivalent to the one-step TD that we have studied so far. That’s the reason why one-step TD is also known as TD(0). The value of λ = 1 makes it similar to the regular ∞-step TD or in other words an MC method. Eligibility trace makes it possible to apply MC methods on nonepisodic tasks. We will cover only high-level concepts of eligibility trace and TD(λ).

In the previous section, we looked at n-step returns with n=1 taking us to the regular TD method and n=∞ taking us to MC. We also touched upon the fact that neither extreme is good. An algorithm performs best with some intermediate value of n. n-step offered a view on how to unify TD and MC. What eligibility does is to offer an efficient way to combine them without keeping track of the n-step transitions at each step. Until now we have looked at an approach of updating a state value based on the next n transitions in the future. This is called the forward view. However, you could also look backward, i.e., at each time step t, and see the impact that the reward at time step t would have on the preceding n states in past. This is known as backward view and forms the core of TD(λ). The approach allows an efficient implementation of integrating n-step returns in TD learning.

Look back at Figure 4-20. What if instead of choosing different values of n, we combined all the n-step returns with some weight? This is known as λ-return, and the equation is as follows:
$$ {G}_t^{lambda }=left(1-lambda 
ight){sum}_{n=1}^{T-t-1}{lambda}^{n-1}{G}_{t:t+n}+{lambda}^{T-t-1}{G}_t $$
(4.13)
Here, Gt : t + n is the n-step return which uses bootstrapped value of remaining steps at the end of the nth step. It is defined as follows:
$$ {G}_{t:t+n}={R}_{t+1}+gamma {R}_{t+2}+dots +{gamma}^{n-1}{R}_{t+n}+{gamma}^nVleft({S}_{t+n}
ight) $$
(4.14)
If we put λ = 0 in (4.13), we get the following:
$$ {G}_t^0={G}_{t:t+1}={R}_{t+1}+gamma Vleft({S}_{t+1}
ight) $$

The previous expression is similar to the target value for state-action updates of TD(0) in (4.7).

On other hand, putting λ = 1 in (4.13) makes $$ {G}_t^1 $$ mimic MC and return Gt as follows:
$$ {G}_t^1={G}_t={R}_{t+1}+gamma {R}_{t+2}+dots +{gamma}^{T-t-1}{R}_T $$

The TD(λ) algorithm uses the previous λ-return with a trace vector known as eligibility trace to have an efficient online TD method using the “backward” view. Eligibility trace keeps a record of how far in the past a state was observed and by how much would that state’s estimate would be impacted by the current return observed.

We will stop here with the basic introduction of λ-returns, eligibility trace, and TD(λ). For a detailed review of the mathematical derivations as well as the various algorithms based on them, refer to the book Reinforcement Learning: An Introduction by Barto and Sutton, 2nd edition.

Relationships Between DP, MC, and TD

At the beginning of the chapter, we talked about how the DP, MC, and TD methods compare. We introduced the MC methods followed by the TD methods, and then we used n-step and TD(λ) to combine MC and TD as two extremes of the sample-based model-free learning.

To conclude the chapter, Table 4-1 summarizes the comparison of the DP and TD methods.
Table 4-1

Comparison of DP and TD Methods in the Context of Bellman Equations

 

Full Backup (DP)

Sample Backup (TD)

Bellman expectation equation for vπ(s)

Iterative policy evaluation

TD prediction

Bellman expectation equation for qπ(s, a)

Q-policy iteration

SARSA

Bellman optimality equation for qπ(s, a)

Q-value iteration

Q-learning

Summary

In this chapter, we looked at the model-free approach to reinforcement learning. We started by estimating the state value using the Monte Carlo approach. We looked at the “first visit” and “every visit” approaches. We then looked at the bias and variance trade-off in general and specifically in the context of the MC approaches. With the foundation of MC estimation in place, we looked at MC control methods connecting it with the GPI framework for policy improvement that was introduced in Chapter 3. We saw how GPI could be applied by swapping the estimation step of the approach from DP-based to an MC-based approach. We looked in detail at the exploration exploitation dilemma that needs to be balanced, especially in the model-free world where the transition probabilities are not known. We then briefly talked about the off-policy approach in the context of the MC methods.

TD was the next approach we looked into with respect to model-free learning. We started off by establishing the basics of TD learning, starting with TD-based value estimation. This was followed by a deep dive into SARSA, an on-policy TD control method. We then looked into Q-learning, a powerful off-policy TD learning approach, and some of its variants like expected SARSA.

In the context of TD learning, we also introduced the concept of state approximation to convert continuous state spaces into approximate discrete state values. The concept of state approximation will form the bulk of the next chapter and will allow us to combine deep learning with reinforcement learning.

Before concluding the chapter, we finally looked at n-step returns, eligibility traces, and TD(λ) as ways to combine TD and MC into a single framework.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.118.250