Proximal Policy Optimization

Now we will look at another policy optimization algorithm called Proximal Policy Optimization (PPO). It acts as an improvement to TRPO and has become the default RL algorithm of choice in solving many complex RL problems due to its performance. It was proposed by researchers at OpenAI for overcoming the shortcomings of TRPO. Recall the surrogate objective function of TRPO. It is a constraint optimization problem where we impose a constraint—that average KL divergence between the old and new policy should be less than But the problem with TRPO is that it requires a lot of computing power for computing conjugate gradients to perform constrained optimization. 

So, PPO modifies the objective function of TRPO by changing the constraint to a penalty term so that we don't want to perform conjugate gradient. Now let's see how PPO works. We define  as a probability ratio between new and old policy. So, we can write our objective function as:

LCPI denotes the conservative policy iteration. But maximizing L would lead to a large policy update without constraint. So, we redefine our objective function by adding the penalty term which penalizes a large policy update. Now the objective function becomes:

We have just added a new term, , to the actual equation. What does this mean? It actually clips the value of  between the interval , that is, if the value of causes the objective function to increase, heavily clipping the value between an interval will reduce its effects.

We clip the probability ratio either at  or  based on two cases:

  • Case 1

When the advantage is positive, which means that the corresponding action should be preferred over the average of all other actions. We will increase the value of  for that action, so it will have a greater chance of being selected. As we are performing a clipping value of , will not exceed greater than :

 

  • Case 2

When the value of the advantage is negative, this means that the action has no significance and it should not be adopted. So, in this case, we will reduce the value of  for that action so that it will have a lower chance of being selected. Similarly, as we are performing clipping, a value of  will not decrease to less than :

When we are using neural network architectures, we must define the loss function which includes the value function error for our objective function. We will also add entropy loss to ensure enough exploration, as we did in A3C. So our final objective function becomes: 

c1 and c2 are the coefficients,  is the squared error loss between the actual and target value function, that is, , and S is the entropy bonus. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.131.255