How A3C works

First, the worker agent resets the global network, and then they start interacting with the environment. Each worker follows a different exploration policy to learn an optimal policy. Following this, they compute value and policy loss and then they calculate the gradient of the loss and update the gradients to the global network. The cycle continues as the worker agent starts resetting the global network and repeats the same process. Before looking at the value and policy loss function, we will see how the advantage function is calculated. As we know, advantage is the difference between the Q function and the value function:

Since we don't actually calculate the Q value directly in A3C, we make use of discounted return as an estimate of the Q value. The discounted return R can be written as follows:

We replace the Q function with the discounted return R as follows:

Now, we can write our value loss as the squared difference between the discounted return and the value of a state:

And the policy loss can be defined as follows:

Okay, what is that new term H(π)? It is the entropy term. It is used to ensure sufficient exploration of policy. Entropy tells us the spread of action probabilities. When the entropy value is high, every action's probability will be the same, so the agent will be unsure as to which action to perform, and when the entropy value is lowered, one action will have a higher probability than the others and the agent can pick up the action that has this high probability. Thus, adding entropy to the loss function encourages the agent to explore further and avoid getting stuck at the local optima.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.