Markov Decision Processes

This world that we've framed up happens to be a Markov Decision Process (MDP), which has the following properties:

  • It has a finite set of states, S
  • It has a finite set of actions, A
  •  is the probability that taking action A will transition between state s and state 
  •  is the immediate reward for transition between s and 
  •  is the discount factor, which is how much we discount future rewards over present rewards (more on this later)

Once we have a policy function  that determines which action to take for each state, the MDP has been solved and becomes a Markov chain.

And good news, it's totally possible to solve an MDP perfectly, with one caveat. That caveat is that all the rewards and probabilities for the MDP have to be known. It turns out this caveat is rather important because most of the time an agent can't know all the rewards and state change probabilities because the agent's environment is chaotic, or at least non-deterministic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.114.28