We have already learned a lot about the epsilon-greedy policy. In the epsilon-greedy policy, either we select the best arm with a probability 1-epsilon or we select the arms at random with a probability epsilon:
Now we will see how to select the best arm using the epsilon-greedy policy:
- First, let us initialize all variables:
# number of rounds (iterations)
num_rounds = 20000
# Count of number of times an arm was pulled
count = np.zeros(10)
# Sum of rewards of each arm
sum_rewards = np.zeros(10)
# Q value which is the average reward
Q = np.zeros(10)
- Now we define our epsilon_greedy function:
def epsilon_greedy(epsilon):
rand = np.random.random()
if rand < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q)
return action
- Start pulling the arm:
for i in range(num_rounds):
# Select the arm using epsilon greedy
arm = epsilon_greedy(0.5)
# Get the reward
observation, reward, done, info = env.step(arm)
# update the count of that arm
count[arm] += 1
# Sum the rewards obtained from the arm
sum_rewards[arm]+=reward
# calculate Q value which is the average rewards of the arm
Q[arm] = sum_rewards[arm]/count[arm]
print( 'The optimal arm is {}'.format(np.argmax(Q)))
The following is the output:
The optimal arm is 3