TD(λ) algorithm

In the previous chapter, we introduced the temporal difference strategy, and we discussed a simple example called TD(0). In the case of TD(0), the discounted reward is approximated by using a one-step backup. Hence, if the agent performs an action at in the state st, and the transition to the state st+1 is observed, the approximation becomes the following:

If the task is episodic (as in many real-life scenarios) and has T(ei) steps, the complete backup for the episode ei is as follows:

The previous expression ends when the MDP process reaches an absorbing state; therefore, Rt is the actual value of the discounted reward. The difference between TD(0) and this choice is clear: in the first case, we can update the value function after each transition, whereas with a complete backup, we need to wait for the end of the episode. We can say that this method (which is called Monte Carlo, because it's based on the idea of averaging the overall reward of an entire sequence) is exactly the opposite of TD(0); therefore, it's reasonable to think about an intermediate solution, based on k-step backups. In particular, our goal is to find an online algorithm that can exploit the backups once they are available.

Let's imagine a sequence of four steps. The agent is in the first state and observes a transition; at this point, only a one-step backup is possible, and it's a good idea to update the value function in order to improve the convergence speed. After the second transition, the agent can use a two-step backup; however, it can also consider the first one-step backup in addition to the newer, longer one. So, we have two approximations:

Which of the preceding is the most reliable? Obviously, the second one depends on the first one (in particular, when the value function is almost stabilized), and so on until the end of the episode. Hence, the most common strategy is to employ a weighted average that assigns a different level of importance to each backup (assuming the longest backup has k steps):

Watkins (in Learning from Delayed RewardsWatkins C.I.C.H.Ph.D. Thesis, University of Cambridge, 1989) proved that this approach (with or without averaging) has the fundamental property of reducing the absolute error of the expected Rtk, with respect to the optimal value function, V(s; π). In fact, he proved that the following inequality holds:

As γ is bounded between 0 and 1, the right-hand side is always smaller than the maximum absolute error V(t) - V(s;π), where V(s) is the value of a state during an episode. Therefore, the expected discounted return of a k-step backup (or of a combination of different backups) yields a more accurate estimation of the optimal value function if the policy is chosen to be greedy with respect to it. This is not surprising, as a longer backup incorporates more actual returns, but the importance of this theorem resides in its validity when an average of different k-step backups are employed. In other words, it provides us with the mathematical proof that an intuitive approach actually converges, and it can also effectively improve both the convergence speed and the final accuracy.

However, managing k coefficients is generally problematic, and in many cases, useless. The main idea behind TD(λ) is to employ a single factor, λ, that can be tuned in order to meet specific requirements. The theoretical analysis (or forward view, as referred to by Sutton and Barto) is based, in a general case, on an exponentially decaying average. If we consider a geometric series with λ bounded between 0 and 1 (exclusive), we get:

Hence, we can consider the averaged discounted return Rt(λ) with infinite backups as:

Before defining the finite case, it's helpful to understand how Rt(λ) was built. As λ is bounded between 0 and 1, the factors decay proportionally to λ, so the first backup has the highest impact, and all of the subsequent ones have smaller and smaller influences on the estimation. This means that, in general, we are assuming that the estimation of Rt has more importance to the immediate backups (which become more and more precise), and we exploit the longer ones only to improve the estimated value. Now, it should be clear that λ = 0 is equivalent to TD(0), because only the one-step backup remains in the sum (remember that 00 = 1), while higher values involve all of the remaining backups. Let's now consider an episode ei whose length is T(ei).

Conventionally, if the agent reached an absorbing state at t = T(ei), all of the remaining t+i returns are equal to Rt (this is straightforward, as all of the possible rewards have already been collected); therefore, we can truncate Rt(λ):

The first term of the previous expression involves all of the non-terminal states, while the second is equal to Rt discounted proportionally to the distance between the first time step and the final state. Again, if λ = 0, we obtain TD(0), but we are now also authorized to consider λ = 1 (because the sum is always extended to a finite number of elements). When λ = 1, we obtain Rt(λ) = Rt, which means that we need to wait until the end of the episode to get the actual discounted reward. As explained previously, this method is normally not a first-choice solution, because when the episodes are very long, the agent selects the actions with a value function that is not up to date in the majority of cases. Therefore, TD(λ) is normally employed with λ values less than 1, in order to obtain the advantage of an online update, together with a correction based on the new states. To achieve this goal without looking at the future (we want to update V(s) as soon as new pieces of information are available), we need to introduce the concept of eligibility trace e(s) (sometimes, in the context of computational neuroscience, e(s) is also called stimulus trace).

An eligibility trace for a state s is a function of time that returns the weight (greater than 0) of the specific state. Let's imagine a sequence, s1, s2, ..., sn, and consider a state, si. After a backup V(si) is updated, the agent continues its exploration. When is a new update of si (given longer backups) important? If si is not visited anymore, the effect of longer backups must be smaller and smaller, and si is said to not be eligible for changes in V(s). This is a consequence of the previous assumption that shorter backups must generally have higher importance. So, if si is an initial state (or is immediately after the initial state) and the agent moves to other states, the effect of si must decay. Conversely, if si is revisited, it means that the previous estimation of V(si) is probably wrong, and hence si is eligible for a change. (To better understand this concept, imagine a sequence, s1, s2, s1, .... It's clear that when the agent is in s1, as well as in s2, it cannot select the right action; therefore, it's necessary to reevaluate V(s) until the agent is able to move forward.)

The most common strategy (which is also discussed in Reinforcement LearningSutton R. S.,‎ Barto A. G., The MIT Press) is to define the eligibility traces in a recursive fashion. After each time step, et(s) decays by a factor equal to γλ (to meet the requirement imposed by the forward view); but, when the state s is revisited, et(s) is also increased by 1 (et(s) = γλet-1(s) + 1). In this way, we impose a jump in the trend of e(s) whenever we desire to emphasize its impact. However, as e(s) decays independently of the jumps, the states that are visited and revisited later have a lower impact than the ones that are revisited very soon. The reason for this choice is very intuitive: the importance of a state revisited after a long sequence is clearly lower than the importance of a state that is revisited after a few steps. In fact, the estimation of Rt is obviously wrong if the agent moves back and forth between two states at the beginning of the episode, but the error becomes less significant when the agent revisits a state after having explored other areas. For example, a policy can allow an initial phase in order to reach a partial goal, and then it can force the agent to move back to reach a terminal state.

Exploiting the eligibility traces, TD(λ) can achieve a very fast convergence in more complex environments, with a trade-off between a one-step TD method and a Monte Carlo one (which is normally avoided). At this point, the reader might wonder if we are sure about the convergence, and luckily, the answer is positive. Dayan proved (in The convergence of TD (λ) for General λ, Dayan P., Machine Learning 8, 3–4/1992) that TD(λ) converges for a generic λ with only a few specific assumptions and the fundamental condition that the policy is GLIE. The proof is very technical, and it's beyond the scope of this book; however, the most important assumptions (which are generally met) are:

  • The Markov Decision Process (MDP) has absorbing states (in other words, all of the episodes end in a finite number of steps).
  • All of the transition probabilities are not-null (all states can be visited an infinite number of times).

The first condition is obvious, the absence of absorbing states yields infinite explorations, which are not compatible with a TD method (sometimes it's possible to prematurely end an episode, but this can either be unacceptable (in some contexts) or a sub-optimal choice (in many others)). Moreover, Sutton and Barto (in the aforementioned book) proved that TD(λ) is equivalent to employing the weighted average of discounted return approximations, but without the constraint of looking ahead in the future (which is clearly impossible).

The complete TD(λ) algorithm (with an optional forced termination of the episode) is:

  1. Set an initial deterministic random policy, π(s)
  2. Set the initial value array, V(s) = 0 ∀ s ∈ S
  1. Set the initial eligibility trace array, e(s) = 0 ∀ s ∈ S
  2. Set the number of episodes, Nepisodes
  3. Set a maximum number of steps per episode, Nmax
  4. Set a constant, α (α = 0.1)
  5. Set a constant, γ (γ = 0.9)
  6. Set a constant, λ (λ = 0.5)
  7. Set a counter, e = 0
  8. For i = 1 to Nepisodes:
    1. Create an empty state list, L
    2. Observe the initial state, si, and append si to L
    3. While sj is non-terminal and e < Nmax:
      1. e += 1
      2. Select the action, at π(si)
      3. Observe the transition, (at, si) → (sj, rij)
      4. Compute the TD error as TDerror = rij + γV(sj) - V(si)
      5. Increment the eligibility trace, e(si) += 1.0
      6. For s in L:
        1. Update the value, V(s) += α · TDerror · e(s)
        2. Update the eligibility trace, e(s) *= γλ
      7. Set si = sj
      8. Append sj to L
    4. Update the policy to be greedy with respect to the value function, π(s) = argmaxa Q(s, a)

The reader can better understand the logic of this algorithm by considering the TD error and its back-propagation. Even if this is only a comparison, it's possible to imagine the behavior of TD(λ) as similar to the Stochastic Gradient Descent (SGD) algorithms employed to train a neural network. In fact, the error is propagated to the previous states (analogous to the lower layers of an MLP) and affects them proportionally to their importance, which is defined by their eligibility traces. Hence, a state with a higher eligibility trace can be considered more responsible for the error; therefore, the corresponding value must be corrected proportionally. This isn't a formal explanation, but it can simplify comprehension of the dynamics without an excessive loss of rigor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.196.244