Q learning to balance Cart-Pole

As discussed in the introduction, we have an environment described by a state s (s∈S where S is the set of all possible states) and an agent that can perform an action a (a∈A, where A is set of all possible actions) resulting in the movement of the agent from one state to another. The agent is rewarded for its action, and the goal of the agent is to maximize the reward. In Q learning, the agent learns the action to take (policy, π) by calculating the Quantity of a state-action combination that maximizes reward (R). In making the choice of the action, the agent takes into account not only the present but discounted future rewards:

Q: S × A→R

The agent starts with some arbitrary initial value of Q, and, as the agent selects an action a and receives a reward r, it updates the state s' (which depends on the past state s and action a) and the Q value:

Q(s,a) = (1 - α)Q(s,a) + α [r + γ maxa' Q(s',a') ]

Here, α is the learning rate and γ is the discount factor. The first term preserves the old value of Q, and the second term provides an improved estimate of the Q value (it includes the present reward and discounted rewards for future actions). This will reduce the Q value when the resultant state is undesirable, thus ensuring that the agent will not choose the same action the next time this state is encountered. Similarly, when the resultant state is desirable, the corresponding Q value will increase.

The simplest implementation of Q learning involves maintaining and updating a state-action value lookup table; the size of the table will be N × M where N is the number of all the possible states and M the number of all the possible actions. For most environments, this table will be significantly large; the larger the table, the more time is needed to search and the more memory is needed to store the table and thus it is not a feasible solution. In this recipe, we will use NN implementation of Q learning. Here, the neural networks are employed as a function approximator to predict the value function (Q). NN has output nodes equal to the number of possible actions, and their output signifies the value function of the corresponding action.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.70.170