Introduction

In March 2016, AlphaGo--the program made by Google's DeepMind--defeated the world's best Go player, 18-time world champion Lee Sedol, by 4 to 1. The match was historic because Go is a notoriously difficult game for computers to play, with:

208,168,199,381,979,984,699,478,633,344,862,770,286,522,
453,884,530,548,425,639,456,820,927,419,612,738,015,378,
525,648,451,698,519,643,907,259,916,015,628,128,546,089,
888,314,427, 129,715,319,317,557,736,620,397,247,064,840,935

possible legal board positions. Playing and winning Go cannot be done by simple brute force. It requires skill, creativity, and, as professional Go players say, intuition. 

This remarkable feat was accomplished by AlphaGo with the help of RL algorithm-based deep neural networks combined with a state-of-the-art tree search algorithm. This chapter introduces RL and some algorithms that we employ to perform RL.

So, the first question that arises is what is RL and how is it different from supervised and unsupervised learning, which we explored in earlier chapters?

Anyone who owns a pet knows that the best strategy to train a pet is rewarding it for desirable behavior and punishing it for bad behavior. The RL, also called learning with a critic, is a learning paradigm where the agent learns in the same manner. The Agent here corresponds to our network (program); it can perform a set of Actions (a), which brings about a change in the State (s) of the environment and, in turn, the Agent perceives whether it gets a reward or punishment.

For example, in the case of a dog, the dog is our Agent, the voluntary muscle movements that the dog makes are the actions, and the ground is the environment; the dog perceives our reaction to its action in terms of giving it a bone as a reward:

Adapted from Reinforcement Learning: an Introduction by Sutton and Barto
Even our brain has a group of subcortical nuclei situated at the base of the forebrain called basal ganglia, which, according to neuroscience, are responsible for action selection, that is, help us decide which of several possible actions to execute at any given time.

The aim of the agent is to maximize the rewards and reduce the punishments. There are various challenges involved in making this decision, the most important one being how to maximize future rewards, also known as the temporal credit assignment problemThe agent decides its action based on some policy (π); the agent learns this policy (π) based on its interaction with the environment. There are various policy learning algorithms; we will explore some of them in this chapter. The agent infers the optimal policy (π*) by a process of trial and error, and to learn the optimal policy, the agent requires an environment to interact with; we will be using OpenAI Gym, which provides different environments.

We have given here only a review of the basic concepts involved in RL; we assume that you are familiar with the Markov concepts decision process, discount factor, and value function (state value and action value).

In this chapter, and the recipes that follow, we define an episode as one run of the game, for example, solving one sudoku. Typically, an agent will play many episodes to learn an optimal policy, the one which maximizes the rewards.

It is really amazing to see how the RL agent, without any implicit knowledge of the game, learns to play and, not only play, even beat humans in these games.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.31.26