Algorithm explanation

The algorithm begins with initializing initial values (mostly 0) for the Q-function, Q(S, A). Then, for every episode, a state (S) is initialized, an action is chosen from the given list based on policy gradient techniques (such as the e-greedy method), and is followed by a loop for every time step, where the updated state (S') and reward (R) are observed for the chosen action. Then, a new action (A') is chosen again from the given list using policy gradient techniques and we begin filling the Q-table with values by solving the equation. Finally, the updated state (S') is equated as the current state (S), the new action (A'), is equated as the current action (A) and the loop continues until the current state (S) is terminated.

Now, let's try to implement this in our analogy. Let's start the table with all the values as zeros:

State	Value
(0,0)	[0., 0., 0., 0., 0., 0.],
(0,1)	[0., 0., 0., 0., 0., 0.],
(0,2)	[0., 0., 0., 0., 0., 0.],
...,
(4,2)	[0., 0., 0., 0., 0., 0.],
(4,3)	[0., 0., 0., 0., 0., 0.],
(4,4)	[0., 0., 0., 0., 0., 0.]]

Table with initial values

Considering the same conditions we chose for Q-learning, our equation is given as follows:

Q(2,2;2) ← 0 + 0.4*[(-1)+0.999*(0)-0]
 => Q(2,2;2) ← -0.4

Our table becomes the following:

StateValue


(0,0)	[0., 0., 0., 0., 0., 0.],
(0,1)	[0., 0., 0., 0., 0., 0.],
(0,2)	[0., 0., 0., 0., 0., 0.],
...,
(2,2)	[0., 0., -0.4, 0., 0., 0.]
...,
(4,2)	[0., 0., 0., 0., 0., 0.],
(4,3)	[0., 0., 0., 0., 0., 0.],
(4,4)	[0., 0., 0., 0., 0., 0.]]

Updated table

You might not have seen much difference in this example, but you can logically make out that, instead of using the max function, we're using the e-greedy technique to rechoose the action at random, because of which the algorithm can be time-consuming or can solve it immediately, provided it chooses the right random value. Likewise, the table gets updated for certain time intervals (already explained in Off-policy learning – the Q-learning algorithm section). Our final table may look like this:

State	Value
(0,0)	[ 1. 1. 1. 1. 1. 1. ],
(0,1)	[ 1.21294807 2.30485594 1.73831 2.84424473 9.01048181 -5.74954889],
(0,2)	[ 3.32374208 -2.67730041 2.0805796 1.83409763 8.14755201 -7.0017296 ],
...,
(4,2)	[ -0.94623903 10.93045652 -1.11443659 -1.1139482 -5.40064 -3.16039984],
(4,3)	[ -6.75173275 2.75158375 -7.07323206 -7.49864668 -8.74536711 -11.97352065],
(4,4)	[ -0.42404557 -0.35805959 -0.28086999 18.86740811 -5.40064 -5.56063984]]

Table with values after 3,000 episodes

The preceding explanation is shown in the following code:

alpha = 0.4
gamma = 0.999
epsilon = 0.9
episodes = 3000
max_steps = 2500

def sarsa(alpha, gamma, epsilon, episodes, max_steps):
    env = gym.make('Taxi-v2')
    n_states, n_actions = env.observation_space.n, env.action_space.n
    Q = numpy.zeros((n_states, n_actions))
    timestep_reward = []
    for episode in range(episodes):
        print "Episode: ", episode
        s = env.reset()
        a = epsilon_greedy(n_actions)
        t = 0
        total_reward = 0
        done = False
        while t < max_steps:
            t += 1
            s_, reward, done, info = env.step(a)
            total_reward += reward
            a_ = epsilon_greedy(n_actions)
            if done:
                Q[s, a] += alpha * ( reward - Q[s, a] )
            else:
                Q[s, a] += alpha * ( reward + (gamma * Q[s_, a_]) - Q[s, a] )
            s, a = s_, a_
    return timestep_reward

You can change the a_ variable, which also uses epsilon greedy to chose the next policy. Now that we have seen how both algorithms are represented, let's look at the preceding analogy in simulation. For this, we need to install the OpenAI Gym, NumPy, and pandas libraries.

Table of Contents for Algorithm explanation

Create new playlist

Sign In

Sign Up

Table of Contents for
Algorithm explanation