On-policy learning &#x2013; the SARSA algorithm

SARSA (short for State, Action, Reward, State, Action) technique is an on-policy learning technique. There is one marginal difference between SARSA and Q-learning. In Q-learning, we chose the epsilon greedy technique to choose a policy (say, an action), and while computing the Q-value for that state, we choose the next state action based on the maximum value of available Q-values for every action in that state. In SARSA, we use epsilon greedy once again to choose a state action instead of using the max function. The SARSA-learning equation is given as follows:

The preceding equation is similar to the TD prediction equation we saw earlier as well. As indicated, the only difference is not using the max function and randomly choosing a state-action pair. The SARSA algorithm is given as follows:

Initialize Q(s, a), for all s ∈ S, a ∈ A(s), arbitrarily, and Q(Terminal-state, ·) = 0
 Repeat (for each episode):
   Initialize S
   Choose A from S using policy derived from Q (e.g., e-greedy)
   Repeat (for each step of episode):
     Take action A, observe R, S'
     Choose A' from S'using policy derived from Q (e.g., e-greedy)
     Q(S, A) ← Q(S, A) + α[R + γQ(S', A) − Q(S, A)]
     S ← S'; A ← A';
   until S is Terminal

Table of Contents for
On-policy learning – the SARSA algorithm

On-policy learning – the SARSA algorithm

Table of Contents for On-policy learning &#x2013; the SARSA algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
On-policy learning – the SARSA algorithm