The upper confidence bound algorithm

The upper confidence bound (UCB) algorithm is the most popular and widely used solution for MABPs. This algorithm is based on the principle of optimism in the face of uncertainty. This essentially means, the less uncertain we are about an arm, the more important it becomes to explore that arm.

Assume that we have two arms that can be tried out. If we have tried out the first arm 100 times but the second arm only once, then we are probably reasonably confident about the payoff of the first arm. However, we are very uncertain about the payoff of the second arm. This gives rise to the family of UCB algorithms. This can be further explained through the following diagram:

Illustration to explain upper confident bound algorithm

In the preceding diagram, each bar represents a different arm or an action. The red dot is the true expected reward and the center of the bar represents the observed average reward. The width of the bar represents the confidence interval. We are already aware that, by the law of large numbers, the more samples we have, the closer the observed average gets to the true average, and the more the bar shrinks.

The idea behind UCB algorithms is to always pick the arm or action with the highest upper bound, which is the sum of the observed average and the one-sided width of the confidence interval. This balances the exploration of arms that have not been tried many times with the exploitation of arms that have.

Table of Contents for The upper confidence bound algorithm

Create new playlist

Sign In

Sign Up

Table of Contents for
The upper confidence bound algorithm