Markov Decision Process

To avoid load problems and computational difficulties, the Agent-Environment interaction is considered as a MDP. MDP is a discrete time stochastic control process. At each time step, the process is in a state s, and the decision maker may choose any action a that is available in state s. The process responds at the next time step by randomly moving into a new state s' and giving the decision maker a corresponding reward, r(s,s').

Under these hypotheses, the Agent-Environment interaction can be schematized as follows:

  • The agent and the environment interact at discrete intervals over time, t = 0, 1, 2, … n.
  • At each interval, the agent receives a representation of the state st of the environment.
  • Each element st of S, where S is the set of possible states.
  • Once the state is recognized, the agent must take an action at of A(st), where A(st) is the set of possible actions in the state st.
  • The choice of the action to be taken depends on the objective to be achieved and is mapped through the policy indicated with the symbol π (discounted cumulative reward), which associates the action with at of A(s) for each state s. The term πt(s,a) represents the probability that action a is carried out in the state s.
  • During the next time interval t + 1, as part of the consequence of the action at, the agent receives a numerical reward rt + 1 R corresponding to the action previously taken at.
  • The consequence of the action represents, instead, the new state st. At this point, the agent must again code the state and make the choice of the action.
  • This iteration repeats itself until the achievement of the objective by the agent.

The definition of the status st + 1 depends from the previous state and the action taken MDP, that is:

st + 1 = δ (st,at)

Here δ represents the status function.

In summary:

  • In an MDP, the agent can perceive the status s S in which he is and has an A set of actions at his disposal
  • At each discrete interval t of time, the agent detects the current status st and decides to implement an action at A
  • The environment responds by providing a reward (a reinforcement) rt = r (st, at) and moving into the state st + 1 = δ (st, at)
  • The r and δ functions are part of the environment; they depend only on the current state and action (not the previous ones) and are not necessarily known to the agent
  • The goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reinforcement received during the entire action sequence

Let's go deeper into some of the terms used:

  • A reward function defines the goal in a reinforcement learning problem. It maps the detected states of the environment into a single number, thus defining a reward. As already mentioned, the only goal is to maximize the total reward it receives in the long term. The reward function then defines what the good and bad events are for the agent. The reward function has the need to be correct, and it can be used as a basis for changing the policy. For example, if an action selected by the policy is followed by a low reward, the policy can be changed to select other actions in that situation in the next step.
  • A policy defines the behavior of the learning agent at a given time. It maps both the detected states of the environment and the actions to take when they are in those states. Corresponds to what in psychology would be called a set of rules or associations of stimulus response. Policy is the fundamental part of a reinforcing learning agent, in the sense that it alone is enough to determine behavior.
  • A value function represents how good a state is for an agent. It is equal to the total reward expected for an agent from the status s. The value function depends on the policy with which the agent selects the actions to be performed.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.102.6