In this chapter, we presented the natural evolution of TD(0), based on an average of backups with different lengths. The algorithm, called TD(λ), is extremely powerful, and it assures a faster convergence than TD(0), with only a few (non-restrictive) conditions. We also showed how to implement the Actor-Critic method with TD(0), in order to learn about both a stochastic policy and a value function.
In further sections, we discussed two methods based on the estimation of the Q function: SARSA and Q-learning. They are very similar, but the latter has a greedy approach, and its performance (in particular, the training speed) results in it being superior to SARSA. The Q-learning algorithm is one of the most important models for the latest developments. In fact, it was the first RL approach employed with a Deep Convolutional Network to solve complex environments (like Atari games). For this reason, we also presented a simple example, based on an MLP that processes a visual input and outputs the Q values for each action.
The world of RL is extremely fascinating, and hundreds of researchers work every day to improve algorithms and solve more and more complex problems. I invite the reader to check the references in order to find useful resources that can be exploited to obtain a deeper understanding of the models and their developments. Moreover, I suggest reading the blog posts written by the Google DeepMind team, which is one of the pioneers in the field of deep RL. I also suggest searching for the papers freely available on arXiv.
I'm happy to end this book with this topic, because I believe that RL can provide new and more powerful tools that will dramatically change our lives!