SARSA (you can already guess where this one is, going by the name) works like this:
- The agent starts at state 1
- It then performs action 1 and gets reward 1
- Next, it moves on to state 2, performs action 2, and gets reward 2
- Then, the agent goes back and updates the value of action 1
As you can see, the difference in the two algorithms is in the way the future reward is found. Q-learning uses the highest action possible from state 2, while SARSA uses the value of the action that is actually taken.
Here is the mathematical intuition for SARSA: