The upper confidence bound algorithm

With epsilon-greedy and softmax exploration, we explore random actions with a probability; the random action is useful for exploring various arms, but it might also lead us to try out actions that will not give us a good reward at all. We also don't want to miss out arms that are actually good but give poor rewards in the initial rounds. So we use a new algorithm called the upper confidence bound (UCB). It is based on the principle called optimism in the face of uncertainty.

The UCB algorithm helps us in selecting the best arm based on a confidence interval. Okay, what is a confidence interval? Let us say we have two arms. We pull both of these arms and find that arm one gives us 0.3 rewards and arm two gives us 0.8 rewards. But with one round of pulling the arms, we should not come to the conclusion that arm two will give us the best reward. We have to try pulling the arms several times and take the mean value of rewards obtained by each arm and select the arm whose mean is highest. But how can we find the correct mean value for each of these arms? Here is where the confidence interval comes into the picture. The confidence interval specifies the interval within which the mean reward value of arms lies. If the confidence interval of arm one is [0.2, 0.9], it implies that the mean value of arm one lies within this interval, 0.2 to 0.9. 0.2 is called the lower confidence bound and 0.9 is called the UCB. The UCB selects a machine that has a high UCB to explore.

Let us say we have three slot machines and we have played each of the slot machines ten times. The confidence intervals of these three slot machines are shown in the following diagram:

We can see that slot machine 3 has a high UCB. But we should not come to the conclusion that slot machine 3 will give us a good reward by just pulling ten times. Once we pull the arms several times, our confidence interval will be accurate. So, over time, the confidence interval becomes narrow and shrinks to an actual value, as shown in the next diagram. So now, we can select slot machine 2, which has a high UCB:

The idea behind UCB is very simple:

Select the action (arm) that has a high sum of average reward and upper confidence bound
Pull the arm and receive a reward
Update the arm's reward and confidence bound

But how do we calculate UCB?

We can calculate UCB using the formula where N(a) is the number of times the arm was pulled and t is the total number of rounds.

So, in UCB, we select an arm with the following formula:

First, initialize the variables:

# number of rounds (iterations)
num_rounds = 20000

# Count of number of times an arm was pulled
count = np.zeros(10)

# Sum of rewards of each arm
sum_rewards = np.zeros(10)

# Q value which is the average reward
Q = np.zeros(10)

Now, let us define our UCB function:

def UCB(iters):
    
    ucb = np.zeros(10)
    
    #explore all the arms
    if iters < 10:
        return i
    
    else:
        for arm in range(10):
            
            # calculate upper bound
            upper_bound = math.sqrt((2*math.log(sum(count))) / count[arm])
            
            # add upper bound to the Q value
            ucb[arm] = Q[arm] + upper_bound
            
        # return the arm which has maximum value
        return (np.argmax(ucb))

Let us start pulling the arms:

for i in range(num_rounds):
    
    # Select the arm using UCB
    arm = UCB(i)
    
    # Get the reward
    observation, reward, done, info = env.step(arm) 
    
    # update the count of that arm
    count[arm] += 1
    
    # Sum the rewards obtained from the arm
    sum_rewards[arm]+=reward
    
    # calculate Q value which is the average rewards of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
    
print( 'The optimal arm is {}'.format(np.argmax(Q)))

The output is as follows:

The optimal arm is 1

Table of Contents for The upper confidence bound algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
The upper confidence bound algorithm