n-step returns

n-step returns are a simple but very useful concept known to yield better performance for several reinforcement learning algorithms, not just with the advantage actor-critic-based algorithm. For example, the best performing algorithm to date on the Atari suite of 57 games, which significantly outperforms the second best algorithm, uses n-step returns. We will actually discuss that agent algorithm, called Rainbow, in Chapter 10, Exploring the learning environment landscape: Roboschool, Gym-Retro, StarCraft-II, DMLab.

Let's first get an intuitive understanding of the n-step return process. Let's use the following diagram to illustrate one step in the environment. Assume that the agent is in state  at time t=1 and decides to take action , which results in the environment being transitioned to state  at time t=t+1= 1+1 = 2 with the agent receiving a reward of :

 

We can calculate the 1-step TD return using the following formula:

Here,  is the value estimate of state  according to the value function (critic). In essence, the agent takes a step and uses the received return and the discounted value of the agent's estimate of the value of the next/resulting state to calculate the return.

If we let the agent continue interacting with the environment for a few more steps, the trajectory of the agent can be simplistically represented using the following diagram:

This diagram shows a 5-step interaction between the agent and the environment. Following a similar approach to the 1-step return calculation in the previous paragraph, we can calculate the 5-step return using the following formula:

We can then use this as the TD target in step 3 of the advantage actor-critic algorithm to improve the performance of the agent.

You can see how the performance of the advantage actor-critic agent with the 1-step return compares to the performance with the n-step return by running the ch8/a2c_agent.py script with the learning_step_thresh parameter in the parameters.json file set to 1 (for the 1-step return) and 5 or 10 (for the n-step return) in any of the Gym environments.
For example, you can run
(rl_gym_book) praveen@ubuntu:~/HOIAWOG/ch8$python a2c_agent.py --env Pendulum-v0 with learning_step_thresh=1, monitor its performance using Tensorboard using the command
(rl_gym_book) praveen@ubuntu:~/HOIAWOG/ch8/logs$tensorboard --logdir=., and then after a million or so steps you can compare the performance of the agent trained with learning_step_thresh=10. Note that the trained agent model will be saved at ~/HOIAWOG/ch8/trained_models/A2_Pendulum-v0.ptm . You can rename it or move it to a different directory before you start the second run to start the training from scratch!

To make the concept more explicit, let's discuss how we will use this in step 3 and in the advantage actor-critic algorithm. We will first use the n-step return as the TD target and calculate the TD error using the following formula (step 3 of the algorithm):

We will then follow step 4 in the algorithm discussed in the previous subsection and update the critic. Then, in step 5, we will update the critic using the following update rule:

We will then move on to step 6 of the algorithm to continue with the next state, , using 5-step transitions from  until  and calculating the 5-step return, then repeat the procedure for updating .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.66.13