The epsilon-greedy policy

We have already learned a lot about the epsilon-greedy policy. In the epsilon-greedy policy, either we select the best arm with a probability 1-epsilon or we select the arms at random with a probability epsilon:

Now we will see how to select the best arm using the epsilon-greedy policy:

  1. First, let us initialize all variables:
# number of rounds (iterations)
num_rounds = 20000

# Count of number of times an arm was pulled
count = np.zeros(10)

# Sum of rewards of each arm
sum_rewards = np.zeros(10)

# Q value which is the average reward
Q = np.zeros(10)
  1. Now we define our epsilon_greedy function:
def epsilon_greedy(epsilon):

rand = np.random.random()
if rand < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q)

return action
  1. Start pulling the arm:
for i in range(num_rounds):

# Select the arm using epsilon greedy
arm = epsilon_greedy(0.5)

# Get the reward
observation, reward, done, info = env.step(arm)

# update the count of that arm
count[arm] += 1

# Sum the rewards obtained from the arm
sum_rewards[arm]+=reward

# calculate Q value which is the average rewards of the arm
Q[arm] = sum_rewards[arm]/count[arm]

print( 'The optimal arm is {}'.format(np.argmax(Q)))

The following is the output:

The optimal arm is 3
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.180.43