Prioritized experience replay

In DQN architecture, we use experience replay to remove correlations between the training samples. However, uniformly sampling transitions from the replay memory is not an optimal method. Instead, we can prioritize transitions and sample according to priority. Prioritizing transitions helps the network to learn swiftly and effectively. How do we prioritize the transitions? We prioritize the transitions that have a high TD error. We know that a TD error specifies the difference between the estimated Q value and the actual Q value. So, transitions with a high TD error are the transition we have to focus on and learn from because those are the transitions that deviate from our estimation. Intuitively, let us say you try to solve a set of problems, but you fail in solving two of these problems. You then give priority to those two problems alone to focus on what went wrong and try to fix that:

We use two types of prioritization—proportional prioritization and rank-based prioritization.

In proportional prioritization, we define the priority as:

is the priority of the transition i is the TD error of transition i, and  is simply some positive constant value that makes sure that every transition has non-zero priority. When  is zero, adding  makes the transition have a priority instead of zero priority. However, the transition will have lower priority than the transitions whose  is not zero. The exponent denotes the amount of prioritization being used. When  is zero, then it is simply the uniform case. 

Now, we can translate this priority into a probability using the following formula:

In rank-based prioritization, we define the priority as:

rank(i) specifies the location of the transition i in the replay buffer where the transitions are sorted from high TD error to low TD error. After calculating the priority, we can convert the priority into a probability using the same formula,.

