Q-learning

So, a very specific implementation of reinforcement learning is called Q-learning, and this formalizes what we just talked about a little bit more:

So again, you start with a set of environmental states of the agent (Is there a ghost next to me? Is there a power pill in front of me? Things like that.), we're going to call that s.
I have a set of possible actions that I can take in those states, we're going to call that set of actions a. In the case of Pac-Man, those possible actions are move up, down, left, or right.
Then we have a value for each state/action pair that we'll call Q; that's why we call it Q-learning. So, for each state, a given set of conditions surrounding Pac-Man, a given action will have a value Q. So, moving up might have a given value Q, moving down might have a negative Q value if it means encountering a ghost, for example.

So, we start off with a Q value of 0 for every possible state that Pac-Man could be in. And, as Pac-Man explores a maze, as bad things happen to Pac-Man, we reduce the Q value for the state that Pac-Man was in at the time. So, if Pac-Man ends up getting eaten by a ghost, we penalize whatever he did in that current state. As good things happen to Pac-Man, as he eats a power pill, or eats a ghost, we'll increase the Q value for that action, for the state that he was in. Then, what we can do is use those Q values to inform Pac-Man's future choices, and sort of build a little intelligent agent that can perform optimally, and make a perfect little Pac-Man. From the same image of Pac-Man that we saw just above, we can further define the current state of Pac-Man by defining that he has a wall to the West, empty space to the North and East, a ghost to the South.

We can look at the actions he can take: he can't actually move left at all, but he can move up, down, or right, and we can assign a value to all those actions. By going up or right, nothing really happens at all, there's no power pill or dots to consume. But if he goes left, that's definitely a negative value. We can say for the state given by the current conditions that Pac-Man is surrounded by, moving down would be a really bad choice; there should be a negative Q value for that. Moving left just can't be done at all. Moving up or right or staying neutral, the Q value would remain 0 for those action choices for that given state.

Now, you can also look ahead a little bit, to make an even more intelligent agent. So, I'm actually two steps away from getting a power pill here. So, as Pac-Man were to explore this state, if I were to hit the case of eating that power pill on the next state, I could actually factor that into the Q value for the previous state. If you just have some sort of a discount factor, based on how far away you are in time, how many steps away you are, you can factor that all in together. So, that's a way of actually building in a little bit of memory into the system. You can "look ahead" more than one step by using a discount factor when computing Q (here s is previous state, s' is current state):

Q(s,a) += discount * (reward(s,a) + max(Q(s')) - Q(s,a))

So, the Q value that I experience when I consume that power pill might actually give a boost to the previous Q values that I encountered along the way. So, that's a way to make Q-learning even better.

Table of Contents for Q-learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Q-learning