The

Reinforcement

Learning

Workshop

The Reinforcement Learning Workshop

Copyright © 2020 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Alessandro Palmas, Emanuele Ghelfi, Dr. Alexandra Galina Petre, Mayur Kulkarni, Anand N.S., Quan Nguyen, Aritra Sen, Anthony So, and Saikat Basak

Reviewers: Alberto Boschetti, Richard Brooker, Alekhya Dronavalli, Harshil Jain, Sasikanth Kotti, Nimish Sanghi, Shanmuka Sreenivas, and Pritesh Tiwari

Managing Editors: Snehal Tambe, Aditya Shah, and Ashish James

Acquisitions Editors: Manuraj Nair, Kunal Sawant, Sneha Shinde, Anindya Sil, Archie Vankar, Karan Wadekar, and Alicia Wooding

Production Editor: Shantanu Zagade

Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: August 2020

Production reference: 1140820

ISBN: 978-1-80020-045-6

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface   i

1. Introduction to Reinforcement Learning   1

Introduction   2

Learning Paradigms   3

Introduction to Learning Paradigms   3

Supervised versus Unsupervised versus RL   5

Classifying Common Problems into Learning Scenarios   9

Predicting Whether an Image Contains a Dog or a Cat 9

Detecting and Classifying All Dogs and Cats in an Image 10

Playing Chess 11

Fundamentals of Reinforcement Learning   12

Elements of RL   13

Agent 13

Actions 13

Environment 14

Policy 14

An Example of an Autonomous Driving Environment 15

Exercise 1.01: Implementing a Toy Environment Using Python   16

The Agent-Environment Interface   21

What's the Agent? What's in the Environment? 22

Environment Types   23

Finite versus Continuous 23

Deterministic versus Stochastic 24

Fully Observable versus Partially Observable 25

POMDP versus MDP 26

Single Agents versus Multiple Agents 27

An Action and Its Types   28

Policy   29

Stochastic Policies 30

Policy Parameterizations 32

Exercise 1.02: Implementing a Linear Policy   37

Goals and Rewards   40

Why Discount? 43

Reinforcement Learning Frameworks   43

OpenAI Gym   44

Getting Started with Gym – CartPole 44

Gym Spaces 46

Exercise 1.03: Creating a Space for Image Observations   49

Rendering an Environment 52

Rendering CartPole 53

A Reinforcement Learning Loop with Gym 54

Exercise 1.04: Implementing the Reinforcement Learning Loop with Gym   54

Activity 1.01: Measuring the Performance of a Random Agent   57

OpenAI Baselines   59

Getting Started with Baselines – DQN on CartPole 59

Applications of Reinforcement Learning   63

Games   64

Go   65

Dota 2   67

StarCraft 68

Robot Control   68

Autonomous Driving   69

Summary   71

2. Markov Decision Processes and Bellman Equations   73

Introduction   74

Markov Processes   75

The Markov Property   75

Markov Chains   80

Markov Reward Processes   81

Value Functions and Bellman Equations for MRPs 84

Solving Linear Systems of an Equation Using SciPy 87

Exercise 2.01: Finding the Value Function in an MRP   88

Markov Decision Processes   91

The State-Value Function and the Action-Value Function 94

Bellman Optimality Equation 113

Solving the Bellman Optimality Equation 116

Solving MDPs   116

Algorithm Categorization 116

Value-Based Algorithms 118

Policy Search Algorithms 118

Linear Programming 118

Exercise 2.02: Determining the Best Policy for an MDP Using Linear Programming   121

Gridworld   127

Activity 2.01: Solving Gridworld   128

Summary   129

3. Deep Learning in Practice with TensorFlow 2   131

Introduction   132

An Introduction to TensorFlow and Keras   133

TensorFlow   133

Keras   137

Exercise 3.01: Building a Sequential Model with the Keras High-Level API   140

How to Implement a Neural Network Using TensorFlow   145

Model Creation   145

Model Training   147

Loss Function Definition   148

Optimizer Choice   148

Learning Rate Scheduling   150

Feature Normalization   152

Model Validation   153

Performance Metrics   154

Model Improvement   154

Overfitting 154

Regularization 155

Early Stopping 156

Dropout 156

Data Augmentation 157

Batch Normalization 158

Model Testing and Inference 159

Standard Fully Connected Neural Networks   159

Exercise 3.02: Building a Fully Connected Neural Network Model with the Keras High-Level API   160

Convolutional Neural Networks   162

Exercise 3.03: Building a Convolutional Neural Network Model with the Keras High-Level API   163

Recurrent Neural Networks   165

Exercise 3.04: Building a Recurrent Neural Network Model with the Keras High-Level API   168

Simple Regression Using TensorFlow   170

Exercise 3.05: Creating a Deep Neural Network to Predict the Fuel Efficiency of Cars   171

Simple Classification Using TensorFlow   182

Exercise 3.06: Creating a Deep Neural Network to Classify Events Generated by the ATLAS Experiment in the Quest for Higgs Boson   184

TensorBoard – How to Visualize Data Using TensorBoard   197

Exercise 3.07: Creating a Deep Neural Network to Classify Events Generated by the ATLAS Experiment in the Quest for the Higgs Boson Using TensorBoard for Visualization   201

Activity 3.01: Classifying Fashion Clothes Using a TensorFlow Dataset and TensorFlow 2   207

Summary   209

4. Getting Started with OpenAI and TensorFlow for Reinforcement Learning   211

Introduction   212

OpenAI Gym   213

How to Interact with a Gym Environment   220

Exercise 4.01: Interacting with the Gym Environment   222

Action and Observation Spaces   224

How to Implement a Custom Gym Environment   228

OpenAI Universe – Complex Environment   230

OpenAI Universe Infrastructure   231

Environments   232

Atari Games 232

Flash Games 233

Browser Tasks 233

Running an OpenAI Universe Environment   234

Validating the Universe Infrastructure   236

TensorFlow for Reinforcement Learning   236

Implementing a Policy Network Using TensorFlow   236

Exercise 4.02: Building a Policy Network with TensorFlow   237

Exercise 4.03: Feeding the Policy Network with Environment State Representation   240

How to Save a Policy Network   242

OpenAI Baselines   243

Proximal Policy Optimization   243

Command-Line Usage   244

Methods in OpenAI Baselines   245

Custom Policy Network Architecture   245

Training an RL Agent to Solve a Classic Control Problem   246

Exercise 4.04: Solving a CartPole Environment with the PPO Algorithm   246

Activity 4.01: Training a Reinforcement Learning Agent to Play a Classic Video Game   254

Summary   257

5. Dynamic Programming   259

Introduction   260

Solving Dynamic Programming Problems   261

Memoization   266

The Tabular Method   269

Exercise 5.01: Memoization in Practice   270

Exercise 5.02: The Tabular Method in Practice   273

Identifying Dynamic Programming Problems   277

Optimal Substructures    277

Overlapping Subproblems   277

The Coin-Change Problem   278

Exercise 5.03: Solving the Coin-Change Problem   279

Dynamic Programming in RL   282

Policy and Value Iteration   284

State-Value Functions   284

Action-Value Functions   285

OpenAI Gym: Taxi-v3 Environment   286

Policy Iteration 290

Value Iteration 300

The FrozenLake-v0 Environment   302

Activity 5.01: Implementing Policy and Value Iteration on the FrozenLake-v0 Environment   303

Summary   305

6. Monte Carlo Methods   307

Introduction   308

The Workings of Monte Carlo Methods   309

Understanding Monte Carlo with Blackjack   309

Exercise 6.01: Implementing Monte Carlo in Blackjack   312

Types of Monte Carlo Methods   315

First Visit Monte Carlo Prediction for Estimating the Value Function   316

Exercise 6.02: First Visit Monte Carlo Prediction for Estimating the Value Function in Blackjack   317

Every Visit Monte Carlo Prediction for Estimating the Value Function   321

Exercise 6.03: Every Visit Monte Carlo Prediction for Estimating the Value Function    322

Exploration versus Exploitation Trade-Off   326

Importance Sampling   327

The Pseudocode for Monte Carlo Off-Policy Evaluation   329

Exercise 6.04: Importance Sampling with Monte Carlo   330

Solving Frozen Lake Using Monte Carlo   335

Activity 6.01: Exploring the Frozen Lake Problem – the Reward Function   338

The Pseudocode for Every Visit Monte Carlo Control for Epsilon Soft   340

Activity 6.02 Solving Frozen Lake Using Monte Carlo Control Every Visit Epsilon Soft    341

Summary   343

7. Temporal Difference Learning   345

Introduction to TD Learning    346

TD(0) – SARSA and Q-Learning   347

SARSA – On-Policy Control   349

Exercise 7.01: Using TD(0) SARSA to Solve FrozenLake-v0 Deterministic Transitions   353

The Stochasticity Test   363

Exercise 7.02: Using TD(0) SARSA to Solve FrozenLake-v0 Stochastic Transitions   367

Q-Learning – Off-Policy Control   377

Exercise 7.03: Using TD(0) Q-Learning to Solve FrozenLake-v0 Deterministic Transitions   379

Expected SARSA   388

N-Step TD and TD(λ) Algorithms   389

N-Step TD   389

N-step SARSA 391

N-Step Off-Policy Learning 393

TD(λ)   395

SARSA(λ) 398

Exercise 7.04: Using TD(λ) SARSA to Solve FrozenLake-v0 Deterministic Transitions   400

Exercise 7.05: Using TD(λ) SARSA to Solve FrozenLake-v0 Stochastic Transitions   409

The Relationship between DP, Monte-Carlo, and TD Learning   418

Activity 7.01: Using TD(0) Q-Learning to Solve FrozenLake-v0 Stochastic Transitions   419

Summary   421

8. The Multi-Armed Bandit Problem   423

Introduction   424

Formulation of the MAB Problem   424

Applications of the MAB Problem   425

Background and Terminology   426

MAB Reward Distributions   428

The Python Interface   429

The Greedy Algorithm   434

Implementing the Greedy Algorithm   434

The Explore-then-Commit Algorithm   440

The ε-Greedy Algorithm   441

Exercise 8.01 Implementing the ε-Greedy Algorithm   442

The Softmax Algorithm   450

The UCB algorithm   451

Optimism in the Face of Uncertainty   452

Other Properties of UCB   454

Exercise 8.02 Implementing the UCB Algorithm   454

Thompson Sampling   459

Introduction to Bayesian Probability   460

The Thompson Sampling Algorithm   464

Exercise 8.03: Implementing the Thompson Sampling Algorithm   467

Contextual Bandits   472

Context That Defines a Bandit Problem   472

Queueing Bandits   473

Working with the Queueing API   475

Activity 8.01: Queueing Bandits   476

Summary   481

9. What Is Deep Q-Learning?   483

Introduction   484

Basics of Deep Learning   484

Basics of PyTorch   489

Exercise 9.01: Building a Simple Deep Learning Model in PyTorch   490

PyTorch Utilities   495

The view Function 496

The squeeze Function 496

The unsqueeze Function 497

The max Function 497

The gather Function 499

The State-Value Function and the Bellman Equation   500

Expected Value 501

The Value Function 501

The Value Function for a Deterministic Environment 502

The Value Function for a Stochastic Environment: 503

The Action-Value Function (Q Value Function)    503

Implementing Q Learning to Find Optimal Actions   504

Advantages of Q Learning 506

OpenAI Gym Review   507

Exercise 9.02: Implementing the Q Learning Tabular Method   508

Deep Q Learning   514

Exercise 9.03: Implementing a Working DQN Network with PyTorch in a CartPole-v0 Environment   518

Challenges in DQN   526

Correlation between Steps and the Convergence Issue   526

Experience Replay   526

The Challenge of a Non-Stationary Target   529

The Concept of a Target Network   530

Exercise 9.04: Implementing a Working DQN Network with Experience Replay and a Target Network in PyTorch    533

The Challenge of Overestimation in a DQN   542

Double Deep Q Network (DDQN)   543

Activity 9.01: Implementing a Double Deep Q Network in PyTorch for the CartPole Environment   546

Summary   549

10. Playing an Atari Game with Deep Recurrent Q-Networks   551

Introduction   552

Understanding the Breakout Environment   552

Exercise 10.01: Playing Breakout with a Random Agent    555

CNNs in TensorFlow   557

Exercise 10.02: Designing a CNN Model with TensorFlow   560

Combining a DQN with a CNN   563

Activity 10.01: Training a DQN with CNNs to Play Breakout   564

RNNs in TensorFlow   565

Exercise 10.03: Designing a Combination of CNN and RNN Models with TensorFlow   567

Building a DRQN   571

Activity 10.02: Training a DRQN to Play Breakout   571

Introduction to the Attention Mechanism and DARQN   573

Activity 10.03: Training a DARQN to Play Breakout   575

Summary   577

11. Policy-Based Methods for Reinforcement Learning   579

Introduction   580

Introduction to Value-Based and Model-Based RL   581

Introduction to Actor-Critic Model   583

Policy Gradients   584

Exercise 11.01: Landing a Spacecraft on the Lunar Surface Using Policy Gradients and the Actor-Critic Method   587

Deep Deterministic Policy Gradients   592

Ornstein-Uhlenbeck Noise   592

The ReplayBuffer Class   593

The Actor-Critic Model   595

Exercise 11.02: Creating a Learning Agent   599

Activity 11.01: Creating an Agent That Learns a Model Using DDPG   606

Improving Policy Gradients   607

Trust Region Policy Optimization   608

Proximal Policy Optimization   609

Exercise 11.03: Improving the Lunar Lander Example Using PPO   611

The Advantage Actor-Critic Method   617

Activity 11.02: Loading the Saved Policy to Run the Lunar Lander Simulation   619

Summary   620

12. Evolutionary Strategies for RL   623

Introduction   624

Problems with Gradient-Based Methods   624

Exercise 12.01: Optimization Using Stochastic Gradient Descent   626

Introduction to Genetic Algorithms   629

Exercise 12.02: Implementing Fixed-Value and Uniform Distribution Optimization Using GAs   631

Components: Population Creation   634

Exercise 12.03: Population Creation   636

Components: Parent Selection   638

Exercise 12.04: Implementing the Tournament and Roulette Wheel Techniques   641

Components: Crossover Application   645

Exercise 12.05: Crossover for a New Generation   647

Components: Population Mutation   650

Exercise 12.06: New Generation Development Using Mutation   651

Application to Hyperparameter Selection    655

Exercise 12.07: Implementing GA Hyperparameter Optimization for RNN Training   657

NEAT and Other Formulations   664

Exercise 12.08: XNOR Gate Functionality Using NEAT   666

Activity 12.01: Cart-Pole Activity   674

Summary   677

Appendix   679

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.93.245