0%

An example-rich guide for beginners to start their reinforcement and deep reinforcement learning journey with state-of-the-art distinct algorithms

Key Features

  • Covers a vast spectrum of basic-to-advanced RL algorithms with mathematical explanations of each algorithm
  • Learn how to implement algorithms with code by following examples with line-by-line explanations
  • Explore the latest RL methodologies such as DDPG, PPO, and the use of expert demonstrations

Book Description

With significant enhancements in the quality and quantity of algorithms in recent years, this second edition of Hands-On Reinforcement Learning with Python has been revamped into an example-rich guide to learning state-of-the-art reinforcement learning (RL) and deep RL algorithms with TensorFlow 2 and the OpenAI Gym toolkit.

In addition to exploring RL basics and foundational concepts such as Bellman equation, Markov decision processes, and dynamic programming algorithms, this second edition dives deep into the full spectrum of value-based, policy-based, and actor-critic RL methods. It explores state-of-the-art algorithms such as DQN, TRPO, PPO and ACKTR, DDPG, TD3, and SAC in depth, demystifying the underlying math and demonstrating implementations through simple code examples.

The book has several new chapters dedicated to new RL techniques, including distributional RL, imitation learning, inverse RL, and meta RL. You will learn to leverage stable baselines, an improvement of OpenAI's baseline library, to effortlessly implement popular RL algorithms. The book concludes with an overview of promising approaches such as meta-learning and imagination augmented agents in research.

By the end, you will become skilled in effectively employing RL and deep RL in your real-world projects.

What you will learn

  • Understand core RL concepts including the methodologies, math, and code
  • Train an agent to solve Blackjack, FrozenLake, and many other problems using OpenAI Gym
  • Train an agent to play Ms Pac-Man using a Deep Q Network
  • Learn policy-based, value-based, and actor-critic methods
  • Master the math behind DDPG, TD3, TRPO, PPO, and many others
  • Explore new avenues such as the distributional RL, meta RL, and inverse RL
  • Use Stable Baselines to train an agent to walk and play Atari games

Who this book is for

If you're a machine learning developer with little or no experience with neural networks interested in artificial intelligence and want to learn about reinforcement learning from scratch, this book is for you.

Basic familiarity with linear algebra, calculus, and the Python programming language is required. Some experience with TensorFlow would be a plus.

Table of Contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. Fundamentals of Reinforcement Learning
    1. Key elements of RL
    2. Agent
    3. Environment
    4. State and action
    5. Reward
    6. The basic idea of RL
    7. The RL algorithm
    8. RL agent in the grid world
    9. How RL differs from other ML paradigms
    10. Markov Decision Processes
    11. The Markov property and Markov chain
    12. The Markov Reward Process
    13. The Markov Decision Process
    14. Fundamental concepts of RL
    15. Math essentials
    16. Expectation
    17. Action space
    18. Policy
    19. Deterministic policy
    20. Stochastic policy
    21. Episode
    22. Episodic and continuous tasks
    23. Horizon
    24. Return and discount factor
    25. Small discount factor
    26. Large discount factor
    27. What happens when we set the discount factor to 0?
    28. What happens when we set the discount factor to 1?
    29. The value function
    30. Q function
    31. Model-based and model-free learning
    32. Different types of environments
    33. Deterministic and stochastic environments
    34. Discrete and continuous environments
    35. Episodic and non-episodic environments
    36. Single and multi-agent environments
    37. Applications of RL
    38. RL glossary
    39. Summary
    40. Questions
    41. Further reading
  3. A Guide to the Gym Toolkit
    1. Setting up our machine
    2. Installing Anaconda
    3. Installing the Gym toolkit
    4. Common error fixes
    5. Creating our first Gym environment
    6. Exploring the environment
    7. States
    8. Actions
    9. Transition probability and reward function
    10. Generating an episode in the Gym environment
    11. Action selection
    12. Generating an episode
    13. More Gym environments
    14. Classic control environments
    15. State space
    16. Action space
    17. Cart-Pole balancing with random policy
    18. Atari game environments
    19. General environment
    20. Deterministic environment
    21. No frame skipping
    22. State and action space
    23. An agent playing the Tennis game
    24. Recording the game
    25. Other environments
    26. Box2D
    27. MuJoCo
    28. Robotics
    29. Toy text
    30. Algorithms
    31. Environment synopsis
    32. Summary
    33. Questions
    34. Further reading
  4. The Bellman Equation and Dynamic Programming
    1. The Bellman equation
    2. The Bellman equation of the value function
    3. The Bellman equation of the Q function
    4. The Bellman optimality equation
    5. The relationship between the value and Q functions
    6. Dynamic programming
    7. Value iteration
    8. The value iteration algorithm
    9. Solving the Frozen Lake problem with value iteration
    10. Policy iteration
    11. Algorithm – policy iteration
    12. Solving the Frozen Lake problem with policy iteration
    13. Is DP applicable to all environments?
    14. Summary
    15. Questions
  5. Monte Carlo Methods
    1. Understanding the Monte Carlo method
    2. Prediction and control tasks
    3. Prediction task
    4. Control task
    5. Monte Carlo prediction
    6. MC prediction algorithm
    7. Types of MC prediction
    8. First-visit Monte Carlo
    9. Every-visit Monte Carlo
    10. Implementing the Monte Carlo prediction method
    11. Understanding the blackjack game
    12. The blackjack environment in the Gym library
    13. Every-visit MC prediction with the blackjack game
    14. First-visit MC prediction with the blackjack game
    15. Incremental mean updates
    16. MC prediction (Q function)
    17. Monte Carlo control
    18. MC control algorithm
    19. On-policy Monte Carlo control
    20. Monte Carlo exploring starts
    21. Monte Carlo with the epsilon-greedy policy
    22. Implementing on-policy MC control
    23. Off-policy Monte Carlo control
    24. Is the MC method applicable to all tasks?
    25. Summary
    26. Questions
  6. Understanding Temporal Difference Learning
    1. TD learning
    2. TD prediction
    3. TD prediction algorithm
    4. Predicting the value of states in the Frozen Lake environment
    5. TD control
    6. On-policy TD control – SARSA
    7. Computing the optimal policy using SARSA
    8. Off-policy TD control – Q learning
    9. Computing the optimal policy using Q learning
    10. The difference between Q learning and SARSA
    11. Comparing the DP, MC, and TD methods
    12. Summary
    13. Questions
    14. Further reading
  7. Case Study – The MAB Problem
    1. The MAB problem
    2. Creating a bandit in the Gym
    3. Exploration strategies
    4. Epsilon-greedy
    5. Softmax exploration
    6. Upper confidence bound
    7. Thompson sampling
    8. Applications of MAB
    9. Finding the best advertisement banner using bandits
    10. Creating a dataset
    11. Initialize the variables
    12. Define the epsilon-greedy method
    13. Run the bandit test
    14. Contextual bandits
    15. Summary
    16. Questions
    17. Further reading
  8. Deep Learning Foundations
    1. Biological and artificial neurons
    2. ANN and its layers
    3. Input layer
    4. Hidden layer
    5. Output layer
    6. Exploring activation functions
    7. The sigmoid function
    8. The tanh function
    9. The Rectified Linear Unit function
    10. The softmax function
    11. Forward propagation in ANNs
    12. How does an ANN learn?
    13. Putting it all together
    14. Building a neural network from scratch
    15. Recurrent Neural Networks
    16. The difference between feedforward networks and RNNs
    17. Forward propagation in RNNs
    18. Backpropagating through time
    19. LSTM to the rescue
    20. Understanding the LSTM cell
    21. What are CNNs?
    22. Convolutional layers
    23. Strides
    24. Padding
    25. Pooling layers
    26. Fully connected layers
    27. The architecture of CNNs
    28. Generative adversarial networks
    29. Breaking down the generator
    30. Breaking down the discriminator
    31. How do they learn, though?
    32. Architecture of a GAN
    33. Demystifying the loss function
    34. Discriminator loss
    35. Generator loss
    36. Total loss
    37. Summary
    38. Questions
    39. Further reading
  9. A Primer on TensorFlow
    1. What is TensorFlow?
    2. Understanding computational graphs and sessions
    3. Sessions
    4. Variables, constants, and placeholders
    5. Variables
    6. Constants
    7. Placeholders and feed dictionaries
    8. Introducing TensorBoard
    9. Creating a name scope
    10. Handwritten digit classification using TensorFlow
    11. Importing the required libraries
    12. Loading the dataset
    13. Defining the number of neurons in each layer
    14. Defining placeholders
    15. Forward propagation
    16. Computing loss and backpropagation
    17. Computing accuracy
    18. Creating a summary
    19. Training the model
    20. Visualizing graphs in TensorBoard
    21. Introducing eager execution
    22. Math operations in TensorFlow
    23. TensorFlow 2.0 and Keras
    24. Bonjour Keras
    25. Defining the model
    26. Compiling the model
    27. Training the model
    28. Evaluating the model
    29. MNIST digit classification using TensorFlow 2.0
    30. Summary
    31. Questions
    32. Further reading
  10. Deep Q Network and Its Variants
    1. What is DQN?
    2. Understanding DQN
    3. Replay buffer
    4. Loss function
    5. Target network
    6. Putting it all together
    7. The DQN algorithm
    8. Playing Atari games using DQN
    9. Architecture of the DQN
    10. Getting hands-on with the DQN
    11. Preprocess the game screen
    12. Defining the DQN class
    13. Training the DQN
    14. The double DQN
    15. The double DQN algorithm
    16. DQN with prioritized experience replay
    17. Types of prioritization
    18. Proportional prioritization
    19. Rank-based prioritization
    20. Correcting the bias
    21. The dueling DQN
    22. Understanding the dueling DQN
    23. The architecture of a dueling DQN
    24. The deep recurrent Q network
    25. The architecture of a DRQN
    26. Summary
    27. Questions
    28. Further reading
  11. Policy Gradient Method
    1. Why policy-based methods?
    2. Policy gradient intuition
    3. Understanding the policy gradient
    4. Deriving the policy gradient
    5. Algorithm – policy gradient
    6. Variance reduction methods
    7. Policy gradient with reward-to-go
    8. Algorithm – Reward-to-go policy gradient
    9. Cart pole balancing with policy gradient
    10. Computing discounted and normalized reward
    11. Building the policy network
    12. Training the network
    13. Policy gradient with baseline
    14. Algorithm – REINFORCE with baseline
    15. Summary
    16. Questions
    17. Further reading
  12. Actor-Critic Methods – A2C and A3C
    1. Overview of the actor-critic method
    2. Understanding the actor-critic method
    3. The actor-critic algorithm
    4. Advantage actor-critic (A2C)
    5. Asynchronous advantage actor-critic (A3C)
    6. The three As
    7. The architecture of A3C
    8. Mountain car climbing using A3C
    9. Creating the mountain car environment
    10. Defining the variables
    11. Defining the actor-critic class
    12. Defining the worker class
    13. Training the network
    14. Visualizing the computational graph
    15. A2C revisited
    16. Summary
    17. Questions
    18. Further reading
  13. Learning DDPG, TD3, and SAC
    1. Deep deterministic policy gradient
    2. An overview of DDPG
    3. Actor
    4. Critic
    5. DDPG components
    6. Critic network
    7. Actor network
    8. Putting it all together
    9. Algorithm – DDPG
    10. Swinging up a pendulum using DDPG
    11. Creating the Gym environment
    12. Defining the variables
    13. Defining the DDPG class
    14. Training the network
    15. Twin delayed DDPG
    16. Key features of TD3
    17. Clipped double Q learning
    18. Delayed policy updates
    19. Target policy smoothing
    20. Putting it all together
    21. Algorithm – TD3
    22. Soft actor-critic
    23. Understanding soft actor-critic
    24. V and Q functions with the entropy term
    25. Components of SAC
    26. Critic network
    27. Actor network
    28. Putting it all together
    29. Algorithm – SAC
    30. Summary
    31. Questions
    32. Further reading
  14. TRPO, PPO, and ACKTR Methods
    1. Trust region policy optimization
    2. Math essentials
    3. The Taylor series
    4. The trust region method
    5. The conjugate gradient method
    6. Lagrange multipliers
    7. Importance sampling
    8. Designing the TRPO objective function
    9. Parameterizing the policies
    10. Sample-based estimation
    11. Solving the TRPO objective function
    12. Computing the search direction
    13. Performing a line search in the search direction
    14. Algorithm – TRPO
    15. Proximal policy optimization
    16. PPO with a clipped objective
    17. Algorithm – PPO-clipped
    18. Implementing the PPO-clipped method
    19. Creating the Gym environment
    20. Defining the PPO class
    21. Training the network
    22. PPO with a penalized objective
    23. Algorithm – PPO-penalty
    24. Actor-critic using Kronecker-factored trust region
    25. Math essentials
    26. Block matrix
    27. Block diagonal matrix
    28. The Kronecker product
    29. The vec operator
    30. Properties of the Kronecker product
    31. Kronecker-Factored Approximate Curvature (K-FAC)
    32. K-FAC in actor-critic
    33. Incorporating the trust region
    34. Summary
    35. Questions
    36. Further reading
  15. Distributional Reinforcement Learning
    1. Why distributional reinforcement learning?
    2. Categorical DQN
    3. Predicting the value distribution
    4. Selecting an action based on the value distribution
    5. Training the categorical DQN
    6. Projection step
    7. Putting it all together
    8. Algorithm – categorical DQN
    9. Playing Atari games using a categorical DQN
    10. Defining the variables
    11. Defining the replay buffer
    12. Defining the categorical DQN class
    13. Quantile Regression DQN
    14. Math essentials
    15. Quantile
    16. Inverse CDF (quantile function)
    17. Understanding QR-DQN
    18. Action selection
    19. Loss function
    20. Distributed Distributional DDPG
    21. Critic network
    22. Actor network
    23. Algorithm – D4PG
    24. Summary
    25. Questions
    26. Further reading
  16. Imitation Learning and Inverse RL
    1. Supervised imitation learning
    2. DAgger
    3. Understanding DAgger
    4. Algorithm – DAgger
    5. Deep Q learning from demonstrations
    6. Phases of DQfD
    7. Pre-training phase
    8. Training phase
    9. Loss function of DQfD
    10. Algorithm – DQfD
    11. Inverse reinforcement learning
    12. Maximum entropy IRL
    13. Key terms
    14. Back to maximum entropy IRL
    15. Computing the gradient
    16. Algorithm – maximum entropy IRL
    17. Generative adversarial imitation learning
    18. Formulation of GAIL
    19. Summary
    20. Questions
    21. Further reading
  17. Deep Reinforcement Learning with Stable Baselines
    1. Installing Stable Baselines
    2. Creating our first agent with Stable Baselines
    3. Evaluating the trained agent
    4. Storing and loading the trained agent
    5. Viewing the trained agent
    6. Putting it all together
    7. Vectorized environments
    8. SubprocVecEnv
    9. DummyVecEnv
    10. Integrating custom environments
    11. Playing Atari games with a DQN and its variants
    12. Implementing DQN variants
    13. Lunar lander using A2C
    14. Creating a custom network
    15. Swinging up a pendulum using DDPG
    16. Viewing the computational graph in TensorBoard
    17. Training an agent to walk using TRPO
    18. Installing the MuJoCo environment
    19. Implementing TRPO
    20. Recording the video
    21. Training a cheetah bot to run using PPO
    22. Making a GIF of a trained agent
    23. Implementing GAIL
    24. Summary
    25. Questions
    26. Further reading
  18. Reinforcement Learning Frontiers
    1. Meta reinforcement learning
    2. Model-agnostic meta learning
    3. Understanding MAML
    4. MAML in a supervised learning setting
    5. MAML in a reinforcement learning setting
    6. Hierarchical reinforcement learning
    7. MAXQ value function Decomposition
    8. Imagination augmented agents
    9. Summary
    10. Questions
    11. Further reading
  19. Appendix 1 – Reinforcement Learning Algorithms
    1. Reinforcement learning algorithm
    2. Value Iteration
    3. Policy Iteration
    4. First-Visit MC Prediction
    5. Every-Visit MC Prediction
    6. MC Prediction – the Q Function
    7. MC Control Method
    8. On-Policy MC Control – Exploring starts
    9. On-Policy MC Control – Epsilon-Greedy
    10. Off-Policy MC Control
    11. TD Prediction
    12. On-Policy TD Control – SARSA
    13. Off-Policy TD Control – Q Learning
    14. Deep Q Learning
    15. Double DQN
    16. REINFORCE Policy Gradient
    17. Policy Gradient with Reward-To-Go
    18. REINFORCE with Baseline
    19. Advantage Actor Critic
    20. Asynchronous Advantage Actor-Critic
    21. Deep Deterministic Policy Gradient
    22. Twin Delayed DDPG
    23. Soft Actor-Critic
    24. Trust Region Policy Optimization
    25. PPO-Clipped
    26. PPO-Penalty
    27. Categorical DQN
    28. Distributed Distributional DDPG
    29. DAgger
    30. Deep Q learning from demonstrations
    31. MaxEnt Inverse Reinforcement Learning
    32. MAML in Reinforcement Learning
  20. Appendix 2 – Assessments
    1. Chapter 1 – Fundamentals of Reinforcement Learning
    2. Chapter 2 – A Guide to the Gym Toolkit
    3. Chapter 3 – The Bellman Equation and Dynamic Programming
    4. Chapter 4 – Monte Carlo Methods
    5. Chapter 5 – Understanding Temporal Difference Learning
    6. Chapter 6 – Case Study – The MAB Problem
    7. Chapter 7 – Deep Learning Foundations
    8. Chapter 8 – A Primer on TensorFlow
    9. Chapter 9 – Deep Q Network and Its Variants
    10. Chapter 10 – Policy Gradient Method
    11. Chapter 11 – Actor-Critic Methods – A2C and A3C
    12. Chapter 12 – Learning DDPG, TD3, and SAC
    13. Chapter 13 – TRPO, PPO, and ACKTR Methods
    14. Chapter 14 – Distributional Reinforcement Learning
    15. Chapter 15 – Imitation Learning and Inverse RL
    16. Chapter 16 – Deep Reinforcement Learning with Stable Baselines
    17. Chapter 17 – Reinforcement Learning Frontiers
  21. Other Books You May Enjoy
  22. Index
3.129.42.134