Value iteration

An alternative approach to policy iteration is provided by the value iteration algorithm. The main assumption is based on the empirical observation that the policy evaluation step converges rather quickly and it's reasonable to stop the process after a fixed number of steps (normally 1). In fact, policy iteration can be imagined like a game where the first player tries to find the correct values considering a stable policy, while the other one creates a new policy that is greedy with respect to the new values. Clearly, the second step compromises the validity of the previous evaluation, forcing the first player to repeat the process. However, as the Bellman equation uses a single fixed point, the algorithm converges to a solution characterized by the fact that the policy doesn't change anymore and, consequently, the evaluation becomes stable. This process can be simplified by removing the policy improvement step and continuing the evaluation in a greedy fashion. Formally, each step is based on the following update rule:

Now the iteration doesn't consider the policy anymore (assuming implicitly that it will be greedy with respect to the values), and selects V(t+1) as the maximum possible value among all V(t)(at). In other words, value iteration anticipates the choice that is made by the policy improvement step by selecting the value that corresponds to the action that is likely (p → 1) to be selected. It's not difficult to extend the convergence proof presented in the previous section to this case, therefore, V(∞) → V(opt), as well as policy iteration does. However, the average number of iterations is normally smaller because we are starting with a random policy that can contrast the value iteration process.

When the values become stable, the optimal greedy policy is simply obtained as:

This step is formally equivalent to a policy improvement iteration, which, however, is done only once at the end of the process.

The complete value iteration algorithm (as proposed by Sutton and Barto) is:

  1. Set the initial value array, V(s) = 0 ∀ s ∈ S
  2. Set a tolerance threshold, Thr, (for example, Thr = 0.0001)
  3. Set a maximum number of iteration, Niter
  4. Set a counter, e = 0
  5. While e < Niter:
    1. e += 1
    2. Do:
      1. Set Vold(s) = V(s) ∀ s ∈ S
      2. Perform a value evaluation step reading the current value from Vold(s) and updating V(s)
    3. While Avg(|V(s) - Vold(s)|) > Thr
  6. Output the final deterministic policy π(s) = argmaxa Q(s, a)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.75.70