TD prediction

Now that we have an example, let's look at the mathematical expression of the TD prediction equation:

The preceding equation states that the value of the current state, , is given by the sum of the value of the current state, , and the learning rate  times the TD error. Wait, what is the TD error? Let's take a look:

It is the difference between the current state value and the predicted state reward, where the predicted state reward is represented as the sum of the reward of the predicted state and discount times' value of the predicted state. The predicted state reward is also known as the TD target. The TD algorithm for evaluating the value function is given as follows:

Input: the policy π to be evaluated

Initialize V(s) arbitrarily
Repeat (for each episode):
Initialize S
Repeat (for each step of episode):
A ← action given by π for S
Take action A, observe R, S'
V (S) ← V (S) + α[R + γV (S) − V (S')]
S ← S'
until S is Terminal
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.253.152