Target network

In our loss function, we calculate the squared difference between a target and predicted value:

 

We are using the same Q function for calculating the target value and the predicted value. In the preceding equation, you can see the same weights  are used for both target Q and predicted Q. Since the same network is calculating the predicted value and target value, there could be a lot of divergence between these two. 

To avoid this problem, we use a separate network called a target network for calculating the target value. So, our loss function becomes:

You may notice that the parameter of target Q is  instead of . Our actual Q network, which is used for predicting Q values, learns the correct weights of  by using gradient descent. The target network is frozen for several time steps and then the target network weights are updated by copying the weights from the actual Q network. Freezing the target network for a while and then updating its weights with the actual Q network weights stabilizes the training. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.198