Bandit Algorithms

(Jeff_L) #1
6.2 Notes 90

plays arms alternately until it decides based on its observations to commit to a
single arm for the remainder (Exercise 6.5).
Experiment 6.1 Fig. 6.1 shows the expected regret of ETC when playing a
Gaussian bandit withk= 2 and meansμ 1 = 0 andμ 2 =−∆. The horizon is set
ton= 1000 and the suboptimality gap ∆ is varied between 0 and 1. Each data
point is the average of 10^5 simulations, which makes the error bars invisible. The
results show that the theoretical upper bound provided by Theorem 6.1 is quite
close to the actual performance.

0 0. 2 0. 4 0. 6 0. 8 1

20

40

60

80


Expected regret

Upper bound in (6.6)
ETC withmin (6.5)

Figure 6.1The expected regret of ETC and the upper bound in Eq. (6.6).

6.2 Notes


1 An algorithm is calledanytimeif it does not require advance knowledge
of the horizonn. Explore-then-commit is not anytime because the choice of
commitment time depends on the horizon. This limitation can be addressed by
thedoubling trick, which is a simple way to convert a horizon-dependent
algorithm into an anytime algorithm (Exercise 6.6).
2 By allowing the exploration timemto be a data-dependent random variable it
is possible to recover near-optimal regret without knowing the suboptimality
gap. For more details see Exercise 6.5. Another idea is to use anelimination
algorithmthat acts in phases and eliminates arms using increasingly sensitive
hypothesis tests (Exercise 6.8). Elimination algorithms are often easy to analyze
and can work well in practice.
3 Theε-greedy algorithm is a randomized relative of ETC that in each roundt
Free download pdf