To demonstrate the problem let's say our agent is the driver. There are four locations and the agent has to pick up a passenger at one location and drop them off at another. The agent will receive +20 points as a reward for successful drop off and -1 point for every time step it takes. The agent will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to pick up and drop off passengers at the correct location in a short time without adding illegal passengers.
The environment is shown here, where the letters (R, G, Y, B) represent the different locations and a tiny rectangle is the agent driving the taxi:
Let's look at the coding part:
import gym
import random
Now we make our environment using a gym:
env = gym.make("Taxi-v1")
What does this taxi environment look like? Like so:
env.render()
Okay, first let us initialize our learning rate alpha, epsilon value, and gamma:
alpha = 0.4
gamma = 0.999
epsilon = 0.017
Then we initialize a Q table; it has a dictionary that stores the state-action value pair as (state, action):
q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
q[(s,a)] = 0.0
We will define the function for updating the Q table via our Q learning update rule; if you look at the following function, you will see that we take the action that has a maximum value for the state-action pair and store it in a qa variable. Then we update the Q value of the previous state via our update rule, as in:
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
q[(prev_state,action)] += alpha * (reward + gamma * qa -q[(prev_state,action)])
Then, we define a function for performing the epsilon-greedy policy where we pass the state and epsilon value. We generate some random number in uniform distribution and if the number is less than the epsilon, we explore a different action in the state, or else we exploit the action that has a maximum q value:
def epsilon_greedy_policy(state, epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])
We will see how to perform Q learning, putting together all these functions:
# For each episode
for i in range(8000):
r = 0
#first we initialize the environment
prev_state = env.reset()
while True:
#In each state we select action by epsilon greedy policy
action = epsilon_greedy_policy(prev_state, epsilon)
#then we take the selected action and move to the next state
nextstate, reward, done, _ = env.step(action)
#and we update the q value using the update_q_table() function
#which updates q table according to our update rule.
update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
#then we update the previous state as next stat
prev_state = nextstate
#and store the rewards in r
r += reward
#If done i.e if we reached the terminal state of the episode
#if break the loop and start the next episode
if done:
break
print("total reward: ", r)
env.close()
The complete code is given here:
import random
import gym
env = gym.make('Taxi-v1')
alpha = 0.4
gamma = 0.999
epsilon = 0.017
q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
q[(s,a)] = 0
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])
def epsilon_greedy_policy(state, epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])
for i in range(8000):
r = 0
prev_state = env.reset()
while True:
env.render()
# In each state, we select the action by epsilon-greedy policy
action = epsilon_greedy_policy(prev_state, epsilon)
# then we perform the action and move to the next state, and
# receive the reward
nextstate, reward, done, _ = env.step(action)
# Next we update the Q value using our update_q_table function
# which updates the Q value by Q learning update rule
update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
# Finally we update the previous state as next state
prev_state = nextstate
# Store all the rewards obtained
r += reward
#we will break the loop, if we are at the terminal
#state of the episode
if done:
break
print("total reward: ", r)
env.close()