Double DQN

Deep Q learning is pretty cool, right? It has generalized its learning to play any Atari game. But the problem with DQN is that it tends to overestimate Q values. This is because of the max operator in the Q learning equation. The max operator uses the same value for both selecting and evaluating an action. What do I mean by that? Let's suppose we are in a state s and we have five actions a1 to a5. Let's say a3 is the best action. When we estimate Q values for all these actions in the state s, the estimated Q values will have some noise and differ from the actual value. Due to this noise, action a2 will get a higher value than the optimal action a3. Now, if we select the best action as the one that has maximum value, we will end up selecting a suboptimal action ainstead of optimal action a3.

We can solve this problem by having two separate Q functions, each learning independently. One Q function is used to select an action and the other Q function is used to evaluate an action. We can implement this by just tweaking the target function of DQN. Recall the target function of DQN:

We can modify our target function as follows:

In the preceding equation, we have two Q functions each with different weights. So a Q function with weights is used to select the action and the other Q function with weights  is used to evaluate the action. We can also switch the roles of these two Q functions. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.36.146