Understanding RL

RL is a very important area but is sometimes overlooked by practitioners for solving complex, real-world problems. It is unfortunate that even most ML textbooks focus only on supervised and unsupervised learning while totally ignorning RL.

RL as an area has picked up momentum in recent years; however, its origins date back to 1980. It was invented by Rich Sutton and Andrew Barto, Rich's PhD thesis advisor. It was thought of as archaic, even back in the 1980s. Rich, however, believed in RL and its promise, maintaining that it would eventually be recognized.

A quick Google search with the term RL shows that RL methods are often used in games, such as checkers and chess. Gaming problems are problems that require taking actions over time to find a long-term optimal solution to a dynamic problem. They are dynamic in the sense that the conditions are constantly changing, sometimes in response to other agents, which can be adversarial.

Although the success of RL is proven in the area of games, it is also an emerging area that is increasingly applied in other fields, such as finance, economics, and other inter-disciplinary areas. There are a number of methods in the RL area that have grown independently within the AI and operations research communities. Therefore, it is key area for a ML practitioners to learn about.

In simple terms, RL is an area that mainly focuses on creating models that learn from mistakes. Imagine that a person is put in a new environment. At first, they will make mistakes, but they will learn from them, so that when the same situation should arise in future, they will not make the same mistake again. RL uses the same technique to train the model as follows:

Environment ----------> Try and fail -----------> Learn from failures ----------> Reach goal

Historically, you couldn't use ML to get an algorithm learn how to become better than a human at performing a certain task. All that could be done was model the machine's behavior after a human's actions and, maybe, the computer would run through them faster. RL, however, makes it possible to create models that become better at performing certain tasks than humans.

Isaac Abhadu, CEO and co-founder at SYBBIO, had this wonderful explanation on Quora detailing the working of RL compared to supervised learning. He stated that an RL framework, in a nutshell, is very similar to that of supervised learning.

Suppose we're trying to get an algorithm to excel at the game of Pong. We have input frames that we will run through a model to get it to produce some random output actions, just as we would in a supervised learning setting. The difference, however, is that in the case of RL, we ourselves do not know what the target labels are, and so we don't tell the machine what's better to do in every specific situation. Instead, we apply something called a policy gradients method.

So, we start with a random network and feed to it an input frame so it produces a random output action to react to that frame. This action is then sent back to the game engine, which makes it produce another frame. This loop continues over and over. The only feedback it will give is the game's scoreboard. Whenever our agent does something right – that is, it produces some successful sequence – it will get a point, generally termed as a reward. Whenever it produces a failing sequence, it will get a point removed—this is a penalty.

The ultimate goal the agent is pursuing is to keep updating its policy to get as much rewards as possible. So, over time, it will figure out how to beat a human at the game.

RL is not quick. The agent is going to lose a lot at first. But we will keep feeding it frames so it keeps producing random output actions, and it will stumble upon actions that are successful. It will keep accumulating knowledge about what moves are successful and, after a while, will become invincible.

Table of Contents for Understanding RL

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding RL