The softmax exploration algorithm

Softmax exploration, also known as Boltzmann exploration, is another strategy used for finding an optimal bandit. In the epsilon-greedy policy, we consider all of the non-best arms equivalently, but in softmax exploration, we select an arm based on a probability from the Boltzmann distribution. The probability of selecting an arm is given by:

is called a temperature factor, which specifies how many random arms we can explore. When is high, all arms will be explored equally, but when is low, high-rewarding arms will be chosen. Look at the following steps:

First, initialize the variables:

# number of rounds (iterations)
num_rounds = 20000

# Count of number of times an arm was pulled
count = np.zeros(10)

# Sum of rewards of each arm
sum_rewards = np.zeros(10)

# Q value which is the average reward
Q = np.zeros(10)

Now we define the softmax function:

def softmax(tau):
    
    total = sum([math.exp(val/tau) for val in Q]) 
    probs = [math.exp(val/tau)/total for val in Q]
    
    threshold = random.random()
    cumulative_prob = 0.0
    for i in range(len(probs)):
        cumulative_prob += probs[i]
        if (cumulative_prob > threshold):
            return i
    return np.argmax(probs)

Start pulling the arm:

for i in range(num_rounds):
    
    # Select the arm using softmax
    arm = softmax(0.5)
    
    # Get the reward
    observation, reward, done, info = env.step(arm) 
    
    # update the count of that arm
    count[arm] += 1
    
    # Sum the rewards obtained from the arm
    sum_rewards[arm]+=reward
    
    # calculate Q value which is the average rewards of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
    
print( 'The optimal arm is {}'.format(np.argmax(Q)))

The following is the output:

The optimal arm is 3

Table of Contents for The softmax exploration algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
The softmax exploration algorithm