Softmax exploration, also known as Boltzmann exploration, is another strategy used for finding an optimal bandit. In the epsilon-greedy policy, we consider all of the non-best arms equivalently, but in softmax exploration, we select an arm based on a probability from the Boltzmann distribution. The probability of selecting an arm is given by:
is called a temperature factor, which specifies how many random arms we can explore. When is high, all arms will be explored equally, but when is low, high-rewarding arms will be chosen. Look at the following steps:
- First, initialize the variables:
# number of rounds (iterations)
num_rounds = 20000
# Count of number of times an arm was pulled
count = np.zeros(10)
# Sum of rewards of each arm
sum_rewards = np.zeros(10)
# Q value which is the average reward
Q = np.zeros(10)
- Now we define the softmax function:
def softmax(tau):
total = sum([math.exp(val/tau) for val in Q])
probs = [math.exp(val/tau)/total for val in Q]
threshold = random.random()
cumulative_prob = 0.0
for i in range(len(probs)):
cumulative_prob += probs[i]
if (cumulative_prob > threshold):
return i
return np.argmax(probs)
- Start pulling the arm:
for i in range(num_rounds):
# Select the arm using softmax
arm = softmax(0.5)
# Get the reward
observation, reward, done, info = env.step(arm)
# update the count of that arm
count[arm] += 1
# Sum the rewards obtained from the arm
sum_rewards[arm]+=reward
# calculate Q value which is the average rewards of the arm
Q[arm] = sum_rewards[arm]/count[arm]
print( 'The optimal arm is {}'.format(np.argmax(Q)))
The following is the output:
The optimal arm is 3