Solving the taxi problem using Q learning

To demonstrate the problem let's say our agent is the driver. There are four locations and the agent has to pick up a passenger at one location and drop them off at another. The agent will receive +20 points as a reward for successful drop off and -1 point for every time step it takes. The agent will also lose -10 points for illegal pickups and drops. So the goal of our agent is to learn to pick up and drop off passengers at the correct location in a short time without adding illegal passengers.

The environment is shown here, where the letters (R, G, Y, B) represent the different locations and a tiny rectangle is the agent driving the taxi:

Let's look at the coding part:

import gym
import random

Now we make our environment using a gym:

env = gym.make("Taxi-v1")

What does this taxi environment look like? Like so:

env.render()

Okay, first let us initialize our learning rate alpha, epsilon value, and gamma:

alpha = 0.4
gamma = 0.999
epsilon = 0.017

Then we initialize a Q table; it has a dictionary that stores the state-action value pair as (state, action):

q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        q[(s,a)] = 0.0

We will define the function for updating the Q table via our Q learning update rule; if you look at the following function, you will see that we take the action that has a maximum value for the state-action pair and store it in a qa variable. Then we update the Q value of the previous state via our update rule, as in:

def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
    qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
    q[(prev_state,action)] += alpha * (reward + gamma * qa -q[(prev_state,action)])

Then, we define a function for performing the epsilon-greedy policy where we pass the state and epsilon value. We generate some random number in uniform distribution and if the number is less than the epsilon, we explore a different action in the state, or else we exploit the action that has a maximum q value:


def epsilon_greedy_policy(state, epsilon):
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])

We will see how to perform Q learning, putting together all these functions:

# For each episode
for i in range(8000):

    r = 0
    #first we initialize the environment

    prev_state = env.reset()
    while True:
        
        #In each state we select action by epsilon greedy policy
        action = epsilon_greedy_policy(prev_state, epsilon)
        
        #then we take the selected action and move to the next state
        nextstate, reward, done, _ = env.step(action)
        
        #and we update the q value using the update_q_table() function 
        #which updates q table according to our update rule.

        update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
        
        #then we update the previous state as next stat
        prev_state = nextstate

        #and store the rewards in r
        r += reward
        
        #If done i.e if we reached the terminal state of the episode 
        #if break the loop and start the next episode
        if done:
            break

    print("total reward: ", r)

env.close()

The complete code is given here:


import random
import gym

env = gym.make('Taxi-v1')

alpha = 0.4
gamma = 0.999
epsilon = 0.017

q = {}
for s in range(env.observation_space.n):
 for a in range(env.action_space.n):
 q[(s,a)] = 0

 
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
 qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
 q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])

def epsilon_greedy_policy(state, epsilon):
 if random.uniform(0,1) < epsilon:
 return env.action_space.sample()
 else:
 return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])

for i in range(8000):
    r = 0
    prev_state = env.reset()    
    while True:
   
        env.render()
     
        # In each state, we select the action by epsilon-greedy policy
        action = epsilon_greedy_policy(prev_state, epsilon)
        
        # then we perform the action and move to the next state, and 
        # receive the reward
        nextstate, reward, done, _ = env.step(action)
        
        # Next we update the Q value using our update_q_table function
        # which updates the Q value by Q learning update rule
        
        update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
        
        # Finally we update the previous state as next state
        prev_state = nextstate

        # Store all the rewards obtained
        r += reward

        #we will break the loop, if we are at the terminal 
        #state of the episode
        if done:
            break

    print("total reward: ", r)

env.close()

Table of Contents for Solving the taxi problem using Q learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Solving the taxi problem using Q learning