The MAB problem

The MAB problem is one of the classical problems in RL. An MAB is actually a slot machine, a gambling game played in a casino where you pull the arm (lever) and get a payout (reward) based on a randomly generated probability distribution. A single slot machine is called a one-armed bandit and, when there are multiple slot machines it is called multi-armed bandits or k-armed bandits.

MABs are shown as follows:

As each slot machine gives us the reward from its own probability distribution, our goal is to find out which slot machine will give us the maximum cumulative reward over a sequence of time. So, at each time step t, the agent performs an action a_t, that is, pulls an arm from the slot machine and receives a reward r_t_, and the goal of our agent is to maximize the cumulative reward.

We define the value of an arm Q(a) as average rewards received by pulling the arm:

So the optimal arm is the one that gives us the maximum cumulative reward, that is:

The goal of our agent is to find the optimal arm and also to minimize the regret, which can be defined as the cost of knowing which of the k arms is optimal. Now, how do we find the best arm? Should we explore all the arms or choose the arm that already gave us a maximum cumulative reward? Here comes the exploration-exploitation dilemma. Now we will see how to solve this dilemma using various exploration strategies as follows:

Epsilon-greedy policy
Softmax exploration
Upper confidence bound algorithm
Thomson sampling technique

Before going ahead, let us install bandit environments in the OpenAI Gym; you can install the bandit environment by typing the following command in your Terminal:

git clone https://github.com/JKCooper2/gym-bandits.git
cd gym-bandits
pip install -e .

After installing, let us import gym and gym_bandits:

import gym_bandits
import gym

Now we will initialize the environment; we use an MAB with ten arms:

env = gym.make("BanditTenArmedGaussian-v0")

Our action space will be 10, as we have 10 arms:

env.action_space

The output is as follows:

Table of Contents for The MAB problem

Create new playlist

Sign In

Sign Up

Table of Contents for
The MAB problem