Solving the taxi problem using Q learning

To demonstrate the problem let's say our agent is the driver. There are four locations and the agent has to pick up a passenger at one location and drop them off at another. The agent will receive +20 points as a reward for successful drop off and -1 point for every time step it takes. The agent will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to pick up and drop off passengers at the correct location in a short time without adding illegal passengers. 

The environment is shown here, where the letters (R, G, Y, B) represent the different locations and a tiny rectangle is the agent driving the taxi:

Let's look at the coding part:

import gym
import random

Now we make our environment using a gym:

env = gym.make("Taxi-v1")

What does this taxi environment look like? Like so:

env.render()

Okay, first let us initialize our learning rate alpha, epsilon value, and gamma:

alpha = 0.4
gamma = 0.999
epsilon = 0.017

Then we initialize a Q table; it has a dictionary that stores the state-action value pair as (state, action):

q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
q[(s,a)] = 0.0

We will define the function for updating the Q table via our Q learning update rule; if you look at the following function, you will see that we take the action that has a maximum value for the state-action pair and store it in a qa variable. Then we update the Q value of the previous state via our update rule, as in: 

def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
q[(prev_state,action)] += alpha * (reward + gamma * qa -q[(prev_state,action)])

Then, we define a function for performing the epsilon-greedy policy where we pass the state and epsilon value. We generate some random number in uniform distribution and if the number is less than the epsilon, we explore a different action in the state, or else we exploit the action that has a maximum q value:


def epsilon_greedy_policy(state, epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])

We will see how to perform Q learning, putting together all these functions:

# For each episode
for i in range(8000):

r = 0
#first we initialize the environment

prev_state = env.reset()
while True:

#In each state we select action by epsilon greedy policy
action = epsilon_greedy_policy(prev_state, epsilon)

#then we take the selected action and move to the next state
nextstate, reward, done, _ = env.step(action)

#and we update the q value using the update_q_table() function
#which updates q table according to our update rule.

update_q_table(prev_state, action, reward, nextstate, alpha, gamma)

#then we update the previous state as next stat
prev_state = nextstate

#and store the rewards in r
r += reward

#If done i.e if we reached the terminal state of the episode
#if break the loop and start the next episode
if done:
break

print("total reward: ", r)

env.close()

The complete code is given here:


import random
import gym

env = gym.make('Taxi-v1')

alpha = 0.4
gamma = 0.999
epsilon = 0.017

q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
q[(s,a)] = 0


def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])

def epsilon_greedy_policy(state, epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])

for i in range(8000):
r = 0
prev_state = env.reset()
while True:

env.render()

# In each state, we select the action by epsilon-greedy policy
action = epsilon_greedy_policy(prev_state, epsilon)

# then we perform the action and move to the next state, and
# receive the reward
nextstate, reward, done, _ = env.step(action)

# Next we update the Q value using our update_q_table function
# which updates the Q value by Q learning update rule

update_q_table(prev_state, action, reward, nextstate, alpha, gamma)

# Finally we update the previous state as next state
prev_state = nextstate

# Store all the rewards obtained
r += reward

#we will break the loop, if we are at the terminal
#state of the episode
if done:
break

print("total reward: ", r)

env.close()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.93.132