The softmax exploration algorithm

Softmax exploration, also known as Boltzmann exploration, is another strategy used for finding an optimal bandit. In the epsilon-greedy policy, we consider all of the non-best arms equivalently, but in softmax exploration, we select an arm based on a probability from the Boltzmann distribution. The probability of selecting an arm is given by:

 is called a temperature factor, which specifies how many random arms we can explore. When  is high, all arms will be explored equally, but when  is low, high-rewarding arms will be chosen. Look at the following steps:

  1. First, initialize the variables:
# number of rounds (iterations)
num_rounds = 20000

# Count of number of times an arm was pulled
count = np.zeros(10)

# Sum of rewards of each arm
sum_rewards = np.zeros(10)

# Q value which is the average reward
Q = np.zeros(10)
  1. Now we define the softmax function:
def softmax(tau):

total = sum([math.exp(val/tau) for val in Q])
probs = [math.exp(val/tau)/total for val in Q]

threshold = random.random()
cumulative_prob = 0.0
for i in range(len(probs)):
cumulative_prob += probs[i]
if (cumulative_prob > threshold):
return i
return np.argmax(probs)
  1. Start pulling the arm:
for i in range(num_rounds):

# Select the arm using softmax
arm = softmax(0.5)

# Get the reward
observation, reward, done, info = env.step(arm)

# update the count of that arm
count[arm] += 1

# Sum the rewards obtained from the arm
sum_rewards[arm]+=reward

# calculate Q value which is the average rewards of the arm
Q[arm] = sum_rewards[arm]/count[arm]

print( 'The optimal arm is {}'.format(np.argmax(Q)))

The following is the output:

The optimal arm is 3
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.93.132