On-policy learning – the SARSA algorithm

SARSA (short for State, Action, Reward, State, Action) technique is an on-policy learning technique. There is one marginal difference between SARSA and Q-learning. In Q-learning, we chose the epsilon greedy technique to choose a policy (say, an action), and while computing the Q-value for that state, we choose the next state action based on the maximum value of available Q-values for every action in that state. In SARSA, we use epsilon greedy once again to choose a state action instead of using the max function. The SARSA-learning equation is given as follows:

The preceding equation is similar to the TD prediction equation we saw earlier as well. As indicated, the only difference is not using the max function and randomly choosing a state-action pair. The SARSA algorithm is given as follows:

Initialize Q(s, a), for all s ∈ S, a ∈ A(s), arbitrarily, and Q(Terminal-state, ·) = 0
Repeat (for each episode):
Initialize S
Choose A from S using policy derived from Q (e.g., e-greedy)
Repeat (for each step of episode):
Take action A, observe R, S'
Choose A' from S'using policy derived from Q (e.g., e-greedy)
Q(S, A) ← Q(S, A) + α[R + γQ(S', A) − Q(S, A)]
S ← S'; A ← A';
until S is Terminal

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.175.180