Q Learning

Imagine that we have some function, Q, that can estimate the reward for taking an action:

For some state s, and action a, it generates a reward for that action given the state. If we knew all the rewards for our environment, we could just loop through Q and pick the action that gives us the biggest reward. But, as we mentioned in the previous section, our agent can't know all the reward states and state probabilities. So, then our Q function needs to attempt to approximate the reward.

We can approximate this ideal Q function with a recursively defined Q function called the Bellman Equation:

In this case, r₀is the reward for the next action and then we use the Q function recursively on the next action (over and over recursively) to determine the future reward for the action. In doing so, we apply gamma as a discount to future rewards relative to current rewards. As long as gamma is less than 1, it keeps our reward series from being infinite. More obviously, a reward in the future state is less less valuable than the same reward in the current state. Concretely, if someone offered you $100 today, or $100 tomorrow, you should take the money now because tomorrow is uncertain.

If we did our best to allow our agent to experience every possible state transition, and used this function to estimate our reward, we would arrive at that ideal Q function we were trying to approximate.

Table of Contents for Q Learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Q Learning