Deep deterministic policy gradient

In Chapter 8, Atari Games with Deep Q Network, we looked at how DQN works and we applied DQNs to play Atari games. However, those are discrete environments where we have a finite set of actions. Think of a continuous environment space like training a robot to walk; in those environments it is not feasible to apply Q learning because finding a greedy policy will require a lot of optimization at each and every step. Even if we make this continuous environment discrete, we might lose important features and end up with a huge set of action spaces. It is difficult to attain convergence when we have a huge action space.

So we use a new architecture called Actor Critic with two networks—Actor and Critic. The Actor Critic architecture combines the policy gradient and state action value functions. The role of the Actor network is to determine the best actions in the state by tuning the parameter , and the role of the Critic is to evaluate the action produced by the Actor. Critic evaluates the Actor's action by computing the temporal difference error. That is, we perform a policy gradient on an Actor network to select the actions and the Critic network evaluates the action produced by the Actor network using the TD error. The Actor Critic architecture is shown in the following diagram: 

Similar to DQN, here we use an experience buffer, using which Actor and Critic networks are trained by sampling a mini batch of experiences. We also use a separate target Actor and Critic network for computing the loss.

For example, in a Pong game we will have different features of different scales such as position, velocity, and so on. So we scale the features in a way that all the features will be in the same scale. We use a method called batch normalization for scaling the features. It normalizes all the features to have unit mean and variance. How do we explore new actions? In a continuous environment, there will be n number of actions. To explore new actions we add some noise N to the action produced by the Actor network. We generate this noise using a process called the Ornstein-Uhlenbeck random process.

Now we will look at the DDPG algorithm in detail.

Let's say we have two networks: the Actor network and Critic network. We represent
the Actor network with  which takes input as a state and results in the action where  is the Actor network weights. We represent the Critic network as , which takes an input as a state and action and returns the Q value where  is the Critic network weights. 

Similarly, we define a target network for both the Actor network and Critic network as  and respectively, where  and  are the weights of the target Actor and Critic network. 

We update Actor network weights with policy gradients and the Critic network weight with the gradients calculated from the TD error. 

First, we select an action by adding the exploration noise N to the action produced by the Actor network, such as . We perform this action in a state, s, receive a reward, r and move to a new state, s'. We store this transition information in an experience replay buffer. 

After some iterations, we sample transitions from the replay buffer and train the network, and then we calculate the target Q value . We compute the TD error as:

Where M is the number of samples from the replay buffer that are used for training. We update our Critic networks weights with gradients calculated from this loss L

Similarly, we update our policy network weights using a policy gradient. Then we update the weights of Actor and Critic network in the target network. We update the weights of the target networks slowly, which promotes greater stability; it is called the soft replacement: 

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.170.223