In Chapter 2, Reinforcement Learning and Deep Reinforcement Learning, we discussed the SARSA and Q-learning algorithms. Both of these algorithms provide a systematic way to update the estimate of the action-value function denoted by . In particular, we saw that Q-learning is an off-policy learning algorithm, which updates the action-value estimate of the current state and action towards the maximum obtainable action-value in the subsequent state, , which the agent will end up in according to its policy. We also saw that the Q-learning update is given by the following formula:
In the next section, we will implement a Q_Learner class in Python, which implements this learning update rule along with the other necessary functions and methods.