0%

Grokking Deep Reinforcement Learning uses engaging exercises to teach you how to build deep learning systems. This book combines annotated Python code with intuitive explanations to explore DRL techniques. You'll see how algorithms function and learn to develop your own DRL agents using evaluative feedback.

Table of Contents

  1. Grokking Deep Reinforcement Learning
  2. Copyright
  3. dedication
  4. contents
  5. front matter
    1. foreword
    2. preface
    3. acknowledgments
    4. about this book
    5. Who should read this book
    6. How this book is organized: a roadmap
    7. About the code
    8. liveBook discussion forum
    9. about the author
  6. 1 Introduction to deep reinforcement learning
    1. What is deep reinforcement learning?
    2. Deep reinforcement learning is a machine learning approach to artificial intelligence
    3. Deep reinforcement learning is concerned with creating computer programs
    4. Deep reinforcement learning agents can solve problems that require intelligence
    5. Deep reinforcement learning agents improve their behavior through trial-and-error learning
    6. Deep reinforcement learning agents learn from sequential feedback
    7. Deep reinforcement learning agents learn from evaluative feedback
    8. Deep reinforcement learning agents learn from sampled feedback
    9. Deep reinforcement learning agents use powerful non-linear function approximation
    10. The past, present, and future of deep reinforcement learning
    11. Recent history of artificial intelligence and deep reinforcement learning
    12. Artificial intelligence winters
    13. The current state of artificial intelligence
    14. Progress in deep reinforcement learning
    15. Opportunities ahead
    16. The suitability of deep reinforcement learning
    17. What are the pros and cons?
    18. Deep reinforcement learning’s strengths
    19. Deep reinforcement learning’s weaknesses
    20. Setting clear two-way expectations
    21. What to expect from the book?
    22. How to get the most out of this book
    23. Deep reinforcement learning development environment
    24. Summary
  7. 2 Mathematical foundations of reinforcement learning
    1. Components of reinforcement learning
    2. Examples of problems, agents, and environments
    3. The agent: The decision maker
    4. The environment: Everything else
    5. Agent-environment interaction cycle
    6. MDPs: The engine of the environment
    7. States: Specific configurations of the environment
    8. Actions: A mechanism to influence the environment
    9. Transition function: Consequences of agent actions
    10. Reward signal: Carrots and sticks
    11. Horizon: Time changes what’s optimal
    12. Discount: The future is uncertain, value it less
    13. Extensions to MDPs
    14. Putting it all together
    15. Summary
  8. 3 Balancing immediate and long-term goals
    1. The objective of a decision-making agent
    2. Policies: Per-state action prescriptions
    3. State-value function: What to expect from here?
    4. Action-value function: What should I expect from here if I do this?
    5. Action-advantage function: How much better if I do that?
    6. Optimality
    7. Planning optimal sequences of actions
    8. Policy evaluation: Rating policies
    9. Policy improvement: Using ratings to get better
    10. Policy iteration: Improving upon improved behaviors
    11. Value iteration: Improving behaviors early
    12. Summary
  9. 4 Balancing the gathering and use of information
    1. The challenge of interpreting evaluative feedback
    2. Bandits: Single-state decision problems
    3. Regret: The cost of exploration
    4. Approaches to solving MAB environments
    5. Greedy: Always exploit
    6. Random: Always explore
    7. Epsilon-greedy: Almost always greedy and sometimes random
    8. Decaying epsilon-greedy: First maximize exploration, then exploitation
    9. Optimistic initialization: Start off believing it’s a wonderful world
    10. Strategic exploration
    11. Softmax: Select actions randomly in proportion to their estimates
    12. UCB: It’s not about optimism, it’s about realistic optimism
    13. Thompson sampling: Balancing reward and risk
    14. Summary
  10. 5 Evaluating agents’ behaviors
    1. Learning to estimate the value of policies
    2. First-visit Monte Carlo: Improving estimates after each episode
    3. Every-visit Monte Carlo: A different way of handling state visits
    4. Temporal-difference learning: Improving estimates after each step
    5. Learning to estimate from multiple steps
    6. N-step TD learning: Improving estimates after a couple of steps
    7. Forward-view TD(λ): Improving estimates of all visited states
    8. TD(λ): Improving estimates of all visited states after each step
    9. Summary
  11. 6 Improving agents’ behaviors
    1. The anatomy of reinforcement learning agents
    2. Most agents gather experience samples
    3. Most agents estimate something
    4. Most agents improve a policy
    5. Generalized policy iteration
    6. Learning to improve policies of behavior
    7. Monte Carlo control: Improving policies after each episode
    8. SARSA: Improving policies after each step
    9. Decoupling behavior from learning
    10. Q-learning: Learning to act optimally, even if we choose not to
    11. Double Q-learning: A max of estimates for an estimate of a max
    12. Summary
  12. 7 Achieving goals more effectively and efficiently
    1. Learning to improve policies using robust targets
    2. SARSA(λ): Improving policies after each step based on multi-step estimates
    3. Watkins’s Q(λ): Decoupling behavior from learning, again
    4. Agents that interact, learn, and plan
    5. Dyna-Q: Learning sample models
    6. Trajectory sampling: Making plans for the immediate future
    7. Summary
  13. 8 Introduction to value-based deep reinforcement learning
    1. The kind of feedback deep reinforcement learning agents use
    2. Deep reinforcement learning agents deal with sequential feedback
    3. But, if it isn’t sequential, what is it?
    4. Deep reinforcement learning agents deal with evaluative feedback
    5. But, if it isn’t evaluative, what is it?
    6. Deep reinforcement learning agents deal with sampled feedback
    7. But, if it isn’t sampled, what is it?
    8. Introduction to function approximation for reinforcement learning
    9. Reinforcement learning problems can have high-dimensional state and action spaces
    10. Reinforcement learning problems can have continuous state and action spaces
    11. There are advantages when using function approximation
    12. NFQ: The first attempt at value-based deep reinforcement learning
    13. First decision point: Selecting a value function to approximate
    14. Second decision point: Selecting a neural network architecture
    15. Third decision point: Selecting what to optimize
    16. Fourth decision point: Selecting the targets for policy evaluation
    17. Fifth decision point: Selecting an exploration strategy
    18. Sixth decision point: Selecting a loss function
    19. Seventh decision point: Selecting an optimization method
    20. Things that could (and do) go wrong
    21. Summary
  14. 9 More stable value-based methods
    1. DQN: Making reinforcement learning more like supervised learning
    2. Common problems in value-based deep reinforcement learning
    3. Using target networks
    4. Using larger networks
    5. Using experience replay
    6. Using other exploration strategies
    7. Double DQN: Mitigating the overestimation of action-value functions
    8. The problem of overestimation, take two
    9. Separating action selection from action evaluation
    10. A solution
    11. A more practical solution
    12. A more forgiving loss function
    13. Things we can still improve on
    14. Summary
  15. 10 Sample-efficient value-based methods
    1. Dueling DDQN: A reinforcement-learning-aware neural network architecture
    2. Reinforcement learning isn’t a supervised learning problem
    3. Nuances of value-based deep reinforcement learning methods
    4. Advantage of using advantages
    5. A reinforcement-learning-aware architecture
    6. Building a dueling network
    7. Reconstructing the action-value function
    8. Continuously updating the target network
    9. What does the dueling network bring to the table?
    10. PER: Prioritizing the replay of meaningful experiences
    11. A smarter way to replay experiences
    12. Then, what’s a good measure of “important” experiences?
    13. Greedy prioritization by TD error
    14. Sampling prioritized experiences stochastically
    15. Proportional prioritization
    16. Rank-based prioritization
    17. Prioritization bias
    18. Summary
  16. 11 Policy-gradient and actor-critic methods
    1. REINFORCE: Outcome-based policy learning
    2. Introduction to policy-gradient methods
    3. Advantages of policy-gradient methods
    4. Learning policies directly
    5. Reducing the variance of the policy gradient
    6. VPG: Learning a value function
    7. Further reducing the variance of the policy gradient
    8. Learning a value function
    9. Encouraging exploration
    10. A3C: Parallel policy updates
    11. Using actor-workers
    12. Using n-step estimates
    13. Non-blocking model updates
    14. GAE: Robust advantage estimation
    15. Generalized advantage estimation
    16. A2C: Synchronous policy updates
    17. Weight-sharing model
    18. Restoring order in policy updates
    19. Summary
  17. 12 Advanced actor-critic methods
    1. DDPG: Approximating a deterministic policy
    2. DDPG uses many tricks from DQN
    3. Learning a deterministic policy
    4. Exploration with deterministic policies
    5. TD3: State-of-the-art improvements over DDPG
    6. Double learning in DDPG
    7. Smoothing the targets used for policy updates
    8. Delaying updates
    9. SAC: Maximizing the expected return and entropy
    10. Adding the entropy to the Bellman equations
    11. Learning the action-value function
    12. Learning the policy
    13. Automatically tuning the entropy coefficient
    14. PPO: Restricting optimization steps
    15. Using the same actor-critic architecture as A2C
    16. Batching experiences
    17. Clipping the policy updates
    18. Clipping the value function updates
    19. Summary
  18. 13 Toward artificial general intelligence
    1. What was covered and what notably wasn’t?
    2. Markov decision processes
    3. Planning methods
    4. Bandit methods
    5. Tabular reinforcement learning
    6. Value-based deep reinforcement learning
    7. Policy-based and actor-critic deep reinforcement learning
    8. Advanced actor-critic techniques
    9. Model-based deep reinforcement learning
    10. Derivative-free optimization methods
    11. More advanced concepts toward AGI
    12. What is AGI, again?
    13. Advanced exploration strategies
    14. Inverse reinforcement learning
    15. Transfer learning
    16. Multi-task learning
    17. Curriculum learning
    18. Meta learning
    19. Hierarchical reinforcement learning
    20. Multi-agent reinforcement learning
    21. Explainable AI, safety, fairness, and ethical standards
    22. What happens next?
    23. How to use DRL to solve custom problems
    24. Going forward
    25. Get yourself out there! Now!
    26. Summary
  19. index
52.14.126.74