Off-policy Monte Carlo control

Off-policy Monte Carlo is another interesting Monte Carlo control method. In this method, we have two policies: one is a behavior policy and another is a target policy. In the off-policy method, agents follow one policy but in the meantime, it tries to learn and improve a different policy. The policy an agent follows is called a behavior policy and the policy an agent tries to evaluate and improve is called a target policy. The behavior and target policy are totally unrelated. The behavior policy explores all possible states and actions and that is why a behavior policy is called a soft policy, whereas a target policy is said to be a greedy policy (it selects the policy which has the maximal value).

Our goal is to estimate the Q function for the target policy π, but our agents behave using a completely different policy called behavior policy . What can we do now? We can estimate the value of  by using the common episodes that took place in . How can we estimate the common episodes between these two policies? We use a new technique called importance sampling. It is a technique for estimating values from one distribution given samples from another. 

Importance sampling is of two types:

  • Ordinary importance sampling
  • Weighted importance sampling

In ordinary importance sampling, we basically take the ratio of returns obtained by the behavior policy and target policy, whereas in weighted importance sampling we take the weighted average and C is the cumulative sum of weights.

Let us just see this step by step:

  1. First, we initialize Q(s,a) to random values and C(s,a) to 0 and weight w as 1.
  2. Then we choose the target policy, which is a greedy policy. This means it will pick up the policy which has a maximum value from the Q table.
  1. We select our behavior policy. A behavior policy is not greedy and it can select any state-action pair.
  2. Then we begin our episode and perform an action a in the state s according to our behavior policy and store the reward. We repeat this until the end of the episode.
  3. Now, for each state in the episode, we do the following:
    1. We will calculate return G. We know that the return is the sum of discounted rewards: G = discount_ factor * G + reward.
    2. Then we update C(s,a) as C(s,a) = C(s,a) + w.
    3. We update Q(s,a):.
    4. We update the value of w.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.199