Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

The Reinforcement Learning Workshop

Copyright © 2020 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Alessandro Palmas, Emanuele Ghelfi, Dr. Alexandra Galina Petre, Mayur Kulkarni, Anand N.S., Quan Nguyen, Aritra Sen, Anthony So, and Saikat Basak

Reviewers: Alberto Boschetti, Richard Brooker, Alekhya Dronavalli, Harshil Jain, Sasikanth Kotti, Nimish Sanghi, Shanmuka Sreenivas, and Pritesh Tiwari

Managing Editors: Snehal Tambe, Aditya Shah, and Ashish James

Acquisitions Editors: Manuraj Nair, Kunal Sawant, Sneha Shinde, Anindya Sil, Archie Vankar, Karan Wadekar, and Alicia Wooding

Production Editor: Shantanu Zagade

Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: August 2020

Production reference: 1140820

ISBN: 978-1-80020-045-6

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface i

1. Introduction to Reinforcement Learning 1

Introduction 2

Learning Paradigms 3

Introduction to Learning Paradigms 3

Supervised versus Unsupervised versus RL 5

Classifying Common Problems into Learning Scenarios 9

Predicting Whether an Image Contains a Dog or a Cat 9

Detecting and Classifying All Dogs and Cats in an Image 10

Playing Chess 11

Fundamentals of Reinforcement Learning 12

Elements of RL 13

Agent 13

Actions 13

Environment 14

Policy 14

An Example of an Autonomous Driving Environment 15

Exercise 1.01: Implementing a Toy Environment Using Python 16

The Agent-Environment Interface 21

What's the Agent? What's in the Environment? 22

Environment Types 23

Finite versus Continuous 23

Deterministic versus Stochastic 24

Fully Observable versus Partially Observable 25

POMDP versus MDP 26

Single Agents versus Multiple Agents 27

An Action and Its Types 28

Policy 29

Stochastic Policies 30

Policy Parameterizations 32

Exercise 1.02: Implementing a Linear Policy 37

Goals and Rewards 40

Why Discount? 43

Reinforcement Learning Frameworks 43

OpenAI Gym 44

Getting Started with Gym – CartPole 44

Gym Spaces 46

Exercise 1.03: Creating a Space for Image Observations 49

Rendering an Environment 52

Rendering CartPole 53

A Reinforcement Learning Loop with Gym 54

Exercise 1.04: Implementing the Reinforcement Learning Loop with Gym 54

Activity 1.01: Measuring the Performance of a Random Agent 57

OpenAI Baselines 59

Getting Started with Baselines – DQN on CartPole 59

Applications of Reinforcement Learning 63

Games 64

Go 65

Dota 2 67

StarCraft 68

Robot Control 68

Autonomous Driving 69

Summary 71

2. Markov Decision Processes and Bellman Equations 73

Introduction 74

Markov Processes 75

The Markov Property 75

Markov Chains 80

Markov Reward Processes 81

Value Functions and Bellman Equations for MRPs 84

Solving Linear Systems of an Equation Using SciPy 87

Exercise 2.01: Finding the Value Function in an MRP 88

Markov Decision Processes 91

The State-Value Function and the Action-Value Function 94

Bellman Optimality Equation 113

Solving the Bellman Optimality Equation 116

Solving MDPs 116

Algorithm Categorization 116

Value-Based Algorithms 118

Policy Search Algorithms 118

Linear Programming 118

Exercise 2.02: Determining the Best Policy for an MDP Using Linear Programming 121

Gridworld 127

Activity 2.01: Solving Gridworld 128

Summary 129

3. Deep Learning in Practice with TensorFlow 2 131

Introduction 132

An Introduction to TensorFlow and Keras 133

TensorFlow 133

Keras 137

Exercise 3.01: Building a Sequential Model with the Keras High-Level API 140

How to Implement a Neural Network Using TensorFlow 145

Model Creation 145

Model Training 147

Loss Function Definition 148

Optimizer Choice 148

Learning Rate Scheduling 150

Feature Normalization 152

Model Validation 153

Performance Metrics 154

Model Improvement 154

Overfitting 154

Regularization 155

Early Stopping 156

Dropout 156

Data Augmentation 157

Batch Normalization 158

Model Testing and Inference 159

Standard Fully Connected Neural Networks 159

Exercise 3.02: Building a Fully Connected Neural Network Model with the Keras High-Level API 160

Convolutional Neural Networks 162

Exercise 3.03: Building a Convolutional Neural Network Model with the Keras High-Level API 163

Recurrent Neural Networks 165

Exercise 3.04: Building a Recurrent Neural Network Model with the Keras High-Level API 168

Simple Regression Using TensorFlow 170

Exercise 3.05: Creating a Deep Neural Network to Predict the Fuel Efficiency of Cars 171

Simple Classification Using TensorFlow 182

Exercise 3.06: Creating a Deep Neural Network to Classify Events Generated by the ATLAS Experiment in the Quest for Higgs Boson 184

TensorBoard – How to Visualize Data Using TensorBoard 197

Exercise 3.07: Creating a Deep Neural Network to Classify Events Generated by the ATLAS Experiment in the Quest for the Higgs Boson Using TensorBoard for Visualization 201

Activity 3.01: Classifying Fashion Clothes Using a TensorFlow Dataset and TensorFlow 2 207

Summary 209

4. Getting Started with OpenAI and TensorFlow for Reinforcement Learning 211

Introduction 212

OpenAI Gym 213

How to Interact with a Gym Environment 220

Exercise 4.01: Interacting with the Gym Environment 222

Action and Observation Spaces 224

How to Implement a Custom Gym Environment 228

OpenAI Universe – Complex Environment 230

OpenAI Universe Infrastructure 231

Environments 232

Atari Games 232

Flash Games 233

Browser Tasks 233

Running an OpenAI Universe Environment 234

Validating the Universe Infrastructure 236

TensorFlow for Reinforcement Learning 236

Implementing a Policy Network Using TensorFlow 236

Exercise 4.02: Building a Policy Network with TensorFlow 237

Exercise 4.03: Feeding the Policy Network with Environment State Representation 240

How to Save a Policy Network 242

OpenAI Baselines 243

Proximal Policy Optimization 243

Command-Line Usage 244

Methods in OpenAI Baselines 245

Custom Policy Network Architecture 245

Training an RL Agent to Solve a Classic Control Problem 246

Exercise 4.04: Solving a CartPole Environment with the PPO Algorithm 246

Activity 4.01: Training a Reinforcement Learning Agent to Play a Classic Video Game 254

Summary 257

5. Dynamic Programming 259

Introduction 260

Solving Dynamic Programming Problems 261

Memoization 266

The Tabular Method 269

Exercise 5.01: Memoization in Practice 270

Exercise 5.02: The Tabular Method in Practice 273

Identifying Dynamic Programming Problems 277

Optimal Substructures 277

Overlapping Subproblems 277

The Coin-Change Problem 278

Exercise 5.03: Solving the Coin-Change Problem 279

Dynamic Programming in RL 282

Policy and Value Iteration 284

State-Value Functions 284

Action-Value Functions 285

OpenAI Gym: Taxi-v3 Environment 286

Policy Iteration 290

Value Iteration 300

The FrozenLake-v0 Environment 302

Activity 5.01: Implementing Policy and Value Iteration on the FrozenLake-v0 Environment 303

Summary 305

6. Monte Carlo Methods 307

Introduction 308

The Workings of Monte Carlo Methods 309

Understanding Monte Carlo with Blackjack 309

Exercise 6.01: Implementing Monte Carlo in Blackjack 312

Types of Monte Carlo Methods 315

First Visit Monte Carlo Prediction for Estimating the Value Function 316

Exercise 6.02: First Visit Monte Carlo Prediction for Estimating the Value Function in Blackjack 317

Every Visit Monte Carlo Prediction for Estimating the Value Function 321

Exercise 6.03: Every Visit Monte Carlo Prediction for Estimating the Value Function 322

Exploration versus Exploitation Trade-Off 326

Importance Sampling 327

The Pseudocode for Monte Carlo Off-Policy Evaluation 329

Exercise 6.04: Importance Sampling with Monte Carlo 330

Solving Frozen Lake Using Monte Carlo 335

Activity 6.01: Exploring the Frozen Lake Problem – the Reward Function 338

The Pseudocode for Every Visit Monte Carlo Control for Epsilon Soft 340

Activity 6.02 Solving Frozen Lake Using Monte Carlo Control Every Visit Epsilon Soft 341

Summary 343

7. Temporal Difference Learning 345

Introduction to TD Learning 346

TD(0) – SARSA and Q-Learning 347

SARSA – On-Policy Control 349

Exercise 7.01: Using TD(0) SARSA to Solve FrozenLake-v0 Deterministic Transitions 353

The Stochasticity Test 363

Exercise 7.02: Using TD(0) SARSA to Solve FrozenLake-v0 Stochastic Transitions 367

Q-Learning – Off-Policy Control 377

Exercise 7.03: Using TD(0) Q-Learning to Solve FrozenLake-v0 Deterministic Transitions 379

Expected SARSA 388

N-Step TD and TD(λ) Algorithms 389

N-Step TD 389

N-step SARSA 391

N-Step Off-Policy Learning 393

TD(λ) 395

SARSA(λ) 398

Exercise 7.04: Using TD(λ) SARSA to Solve FrozenLake-v0 Deterministic Transitions 400

Exercise 7.05: Using TD(λ) SARSA to Solve FrozenLake-v0 Stochastic Transitions 409

The Relationship between DP, Monte-Carlo, and TD Learning 418

Activity 7.01: Using TD(0) Q-Learning to Solve FrozenLake-v0 Stochastic Transitions 419

Summary 421

8. The Multi-Armed Bandit Problem 423

Introduction 424

Formulation of the MAB Problem 424

Applications of the MAB Problem 425

Background and Terminology 426

MAB Reward Distributions 428

The Python Interface 429

The Greedy Algorithm 434

Implementing the Greedy Algorithm 434

The Explore-then-Commit Algorithm 440

The ε-Greedy Algorithm 441

Exercise 8.01 Implementing the ε-Greedy Algorithm 442

The Softmax Algorithm 450

The UCB algorithm 451

Optimism in the Face of Uncertainty 452

Other Properties of UCB 454

Exercise 8.02 Implementing the UCB Algorithm 454

Thompson Sampling 459

Introduction to Bayesian Probability 460

The Thompson Sampling Algorithm 464

Exercise 8.03: Implementing the Thompson Sampling Algorithm 467

Contextual Bandits 472

Context That Defines a Bandit Problem 472

Queueing Bandits 473

Working with the Queueing API 475

Activity 8.01: Queueing Bandits 476

Summary 481

9. What Is Deep Q-Learning? 483

Introduction 484

Basics of Deep Learning 484

Basics of PyTorch 489

Exercise 9.01: Building a Simple Deep Learning Model in PyTorch 490

PyTorch Utilities 495

The view Function 496

The squeeze Function 496

The unsqueeze Function 497

The max Function 497

The gather Function 499

The State-Value Function and the Bellman Equation 500

Expected Value 501

The Value Function 501

The Value Function for a Deterministic Environment 502

The Value Function for a Stochastic Environment: 503

The Action-Value Function (Q Value Function) 503

Implementing Q Learning to Find Optimal Actions 504

Advantages of Q Learning 506

OpenAI Gym Review 507

Exercise 9.02: Implementing the Q Learning Tabular Method 508

Deep Q Learning 514

Exercise 9.03: Implementing a Working DQN Network with PyTorch in a CartPole-v0 Environment 518

Challenges in DQN 526

Correlation between Steps and the Convergence Issue 526

Experience Replay 526

The Challenge of a Non-Stationary Target 529

The Concept of a Target Network 530

Exercise 9.04: Implementing a Working DQN Network with Experience Replay and a Target Network in PyTorch 533

The Challenge of Overestimation in a DQN 542

Double Deep Q Network (DDQN) 543

Activity 9.01: Implementing a Double Deep Q Network in PyTorch for the CartPole Environment 546

Summary 549

10. Playing an Atari Game with Deep Recurrent Q-Networks 551

Introduction 552

Understanding the Breakout Environment 552

Exercise 10.01: Playing Breakout with a Random Agent 555

CNNs in TensorFlow 557

Exercise 10.02: Designing a CNN Model with TensorFlow 560

Combining a DQN with a CNN 563

Activity 10.01: Training a DQN with CNNs to Play Breakout 564

RNNs in TensorFlow 565

Exercise 10.03: Designing a Combination of CNN and RNN Models with TensorFlow 567

Building a DRQN 571

Activity 10.02: Training a DRQN to Play Breakout 571

Introduction to the Attention Mechanism and DARQN 573

Activity 10.03: Training a DARQN to Play Breakout 575

Summary 577

11. Policy-Based Methods for Reinforcement Learning 579

Introduction 580

Introduction to Value-Based and Model-Based RL 581

Introduction to Actor-Critic Model 583

Policy Gradients 584

Exercise 11.01: Landing a Spacecraft on the Lunar Surface Using Policy Gradients and the Actor-Critic Method 587

Deep Deterministic Policy Gradients 592

Ornstein-Uhlenbeck Noise 592

The ReplayBuffer Class 593

The Actor-Critic Model 595

Exercise 11.02: Creating a Learning Agent 599

Activity 11.01: Creating an Agent That Learns a Model Using DDPG 606

Improving Policy Gradients 607

Trust Region Policy Optimization 608

Proximal Policy Optimization 609

Exercise 11.03: Improving the Lunar Lander Example Using PPO 611

The Advantage Actor-Critic Method 617

Activity 11.02: Loading the Saved Policy to Run the Lunar Lander Simulation 619

Summary 620

12. Evolutionary Strategies for RL 623

Introduction 624

Problems with Gradient-Based Methods 624

Exercise 12.01: Optimization Using Stochastic Gradient Descent 626

Introduction to Genetic Algorithms 629

Exercise 12.02: Implementing Fixed-Value and Uniform Distribution Optimization Using GAs 631

Components: Population Creation 634

Exercise 12.03: Population Creation 636

Components: Parent Selection 638

Exercise 12.04: Implementing the Tournament and Roulette Wheel Techniques 641

Components: Crossover Application 645

Exercise 12.05: Crossover for a New Generation 647

Components: Population Mutation 650

Exercise 12.06: New Generation Development Using Mutation 651

Application to Hyperparameter Selection 655

Exercise 12.07: Implementing GA Hyperparameter Optimization for RNN Training 657

NEAT and Other Formulations 664

Exercise 12.08: XNOR Gate Functionality Using NEAT 666

Activity 12.01: Cart-Pole Activity 674

Summary 677

Appendix 679

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.21.93.245