Bandit Algorithms

6.2 Notes 90

plays arms alternately until it decides based on its observations to commit to a single arm for the remainder (Exercise 6.5). Experiment 6.1 Fig. 6.1 shows the expected regret of ETC when playing a Gaussian bandit withk= 2 and meansμ 1 = 0 andμ 2 =−∆. The horizon is set ton= 1000 and the suboptimality gap ∆ is varied between 0 and 1. Each data point is the average of 10^5 simulations, which makes the error bars invisible. The results show that the theoretical upper bound provided by Theorem 6.1 is quite close to the actual performance.

0 0. 2 0. 4 0. 6 0. 8 1

20

40

60

80

∆

Expected regret

Upper bound in (6.6) ETC withmin (6.5)

Figure 6.1The expected regret of ETC and the upper bound in Eq. (6.6).

6.2 Notes

1 An algorithm is calledanytimeif it does not require advance knowledge of the horizonn. Explore-then-commit is not anytime because the choice of commitment time depends on the horizon. This limitation can be addressed by thedoubling trick, which is a simple way to convert a horizon-dependent algorithm into an anytime algorithm (Exercise 6.6). 2 By allowing the exploration timemto be a data-dependent random variable it is possible to recover near-optimal regret without knowing the suboptimality gap. For more details see Exercise 6.5. Another idea is to use anelimination algorithmthat acts in phases and eliminates arms using increasingly sensitive hypothesis tests (Exercise 6.8). Elimination algorithms are often easy to analyze and can work well in practice. 3 Theε-greedy algorithm is a randomized relative of ETC that in each roundt

Bandit Algorithms

6.2 Notes

Get our desktop app

Company

Features

Documentation

Resources