Decayed epsilon greedy

The value of epsilon is key in determining how well the epsilon-greedy algorithm works for a given problem. Instead of setting this value at the start and then decreasing it, we can make epsilon dependent on time. For example, epsilon can be kept equal to 1 / log(t + 0.00001). As time passes, the epsilon value will keep reducing. This method works as over the time that epsilon is reduced, we become more confident of the optimal action and less exploring is required.

The problem with the random selection of actions is that after sufficient time steps, even if we know that some arm is bad, this algorithm will keep choosing that with probability epsilon/n. Essentially, we are exploring a bad action, which does not sound very efficient. The approach to get around this could be to favor exploration of arms with strong potential in order to get an optimal value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.195.111