The epsilon-greedy policy

We have already learned a lot about the epsilon-greedy policy. In the epsilon-greedy policy, either we select the best arm with a probability 1-epsilon or we select the arms at random with a probability epsilon:

Now we will see how to select the best arm using the epsilon-greedy policy:

First, let us initialize all variables:

# number of rounds (iterations)
num_rounds = 20000

# Count of number of times an arm was pulled
count = np.zeros(10)

# Sum of rewards of each arm
sum_rewards = np.zeros(10)

# Q value which is the average reward
Q = np.zeros(10)

Now we define our epsilon_greedy function:

def epsilon_greedy(epsilon):
    
    rand = np.random.random() 
    if rand < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q)
    
    return action

Start pulling the arm:

for i in range(num_rounds):
    
    # Select the arm using epsilon greedy 
    arm = epsilon_greedy(0.5)
    
    # Get the reward
    observation, reward, done, info = env.step(arm) 
    
    # update the count of that arm
    count[arm] += 1
    
    # Sum the rewards obtained from the arm
    sum_rewards[arm]+=reward
    
    # calculate Q value which is the average rewards of the arm
    Q[arm] = sum_rewards[arm]/count[arm]

print( 'The optimal arm is {}'.format(np.argmax(Q)))

The following is the output:

The optimal arm is 3

Table of Contents for The epsilon-greedy policy

Create new playlist

Sign In

Sign Up

Table of Contents for
The epsilon-greedy policy