Reinforcement Learning fundamentals

Imagine that you want to learn to ride a bike and ask a friend for advice. They explain how the gears work, how to release the brake and a few other technical details. In the end, you ask the secret to keeping balanced. What kind of answer do you expect? In an imaginary supervised world, you should be able to perfectly quantify your actions and correct the errors by comparing the outcomes with precise reference values. In the real world, you have no idea about the quantities underlying your actions and, above all, you will never know what the right value is. Increasing the level of abstraction, the scenario we're considering can be described as: a generic agent performs actions inside an environment and receives feedback that is somehow proportional to the competence of its actions.

According to this feedback, the agent can correct its actions in order to reach a specific goal. This basic schema is represented in the following diagram:

Basic RL schema

Returning to our initial example, when you ride a bike for the first time and try to keep your balance, you will notice that the wrong movement causes an increase in the slope, which in turn increases the horizontal component of the gravity force, pushing the bike laterally. As the vertical component is compensated, the result is a rotation that ends when the bike falls down completely. However, as you can use your legs to control the balance, when the bike starts falling, thanks to Newton's third law, the force on the leg increases and your brain understands that it's necessary to make a movement in the opposite direction. Even if this problem can be easily expressed in terms of physical laws, nobody learns to ride a bike by computing forces and momentums. This is one of the main concepts of RL: an agent must always make its choices considering a piece of information, usually defined as a reward, that represents the response, provided by the environment. If the action is correct, the reward will be positive, otherwise, it will be negative. After receiving a reward, an agent can fine-tune the strategy, called policy, in order to maximize the expected future reward. For example, after a few rides, you will be able to slightly move your body so as to keep the balance while turning, but probably, in the beginning, you needed to extend your leg to avoid falling down. Hence, your initial policy suggested a wrong action, which received repeated negative rewards and so your brain corrected it by increasing the probability of choosing another action. The implicit hypothesis that underlies this approach is that an agent is always rational, meaning that its goal is to maximize the expected return of its actions (nobody would like to fall down just to feel a different emotion).

Before discussing the single components of an RL system, it's necessary to add a couple of fundamental assumptions. The first one is that an agent can repeat the experiences an infinite number of times. In other words, we assume that it's possible to learn a valid policy (possibly the optimal one) only if we have enough time. Clearly, this is unacceptable in the animal world and we all know that many experiences are extremely dangerous; however, this assumption is necessary to prove the convergence of some algorithms. Indeed, sub-optimal policies sometimes can be learned very quickly, but it's necessary to iterate many times to reach the optimal one. In real artificial systems, we always stop the learning process after a finite number of iterations, but it's almost impossible to find valid solutions if some experiences prevent the agent from continuing to interact with the environment. As many tasks have final states (either positive or negative), we assume that the agent can play any number of episodes (somewhat analogous to the epochs of supervised learning), exploiting the experience previously learned.

The second assumption is a little bit more technical and it's usually known as the Markov property. When the agent interacts with the environment, it observes a sequence of states. Even if it can seem like an oxymoron, we assume that each state is stateful. We can explain this concept with a simple example; suppose that you're filling a tank and every five seconds you measure the level. Imagine that at t = 0, the level L = 10 and the water is flowing in. What do you expect at t = 1? Obviously, L > 10. In other words, without external unknown causes, we assume that a state contains the previous history, so that the sequence, even if discretized, represents a continuous evolution where no jumps are allowed. When an RL task satisfies this property, it's called a Markov Decision Process and it's very easy to employ simple algorithms to evaluate the actions. Luckily, the majority of natural events can be modeled as MDPs (when you're walking toward a door, every step in the right direction must decrease the distance), but there are some games that are implicitly stateless. For example, if you want to employ an RL algorithm to learn how to guess the outcome of a probabilistic sequence of independent events (such as tossing a coin), the result could be dramatically wrong. The reason is clear: any state is independent of the previous ones and every attempt to build up a history is a failure. Therefore, if you observe a sequence of 0, 0, 0, 0, ... you are not justified in increasing the value of betting on 0 unless, after considering the likelihood of the events, you suppose that the coin is loaded. However, if there's no reason to do so, the process isn't an MDP and every episode (event) is completely independent. All the assumptions that we, either implicitly or explicitly, make are based on this fundamental concept, so pay attention when evaluating new, unusual scenarios because you may discover that the employment of a specific algorithm isn't theoretically justified.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.167.195