Bandit Algorithms

33.4 Notes 386

∑k i=1min{^1 /∆

(^2) i, 1 /∆ (^2) min}. Then
H 2 (μ)≤H 1 (μ)≤(1 + log(k))H 2 (μ). (33.8)
Furthermore, both inequalities are essentially tight (Exercise 33.8).
Let’s see how the bound in Theorem 33.10 compares to uniform exploration,
which is the same as Algorithm 19. Like in the proof of Theorem 33.1, the
probability that uniform exploration selects a suboptimal arm is easily controlled
using Theorem 5.3 and Lemma 5.4.
Pν,UE(∆An+1>0)≤
∑k
i=2
P(ˆμi(n)≥μˆ 1 (n))≤
∑k
i=2
exp

(

−

bn/kc∆^2 i 4

)

Suppose that ∆ = ∆ 2 = ∆k so that all suboptimal arms have the same
suboptimality gap. ThenH 2 =k/∆^2 and terms in the exponent for sequential
halving and uniform exploration areO(n∆^2 /(klogk)) andO(n∆^2 /k) respectively,
which means that uniform exploration is actually moderately better than
sequential halving, at least ifnis sufficiently large. On the other hand, if ∆ 2
is small, but ∆i= 1 for alli >2, thenH 2 =O(1/∆^22 ) and the exponents are
O(n∆^2 ) andO(n∆^2 /k) respectively and sequential halving is significantly better.
The reason for the disparity is the non-adaptivity of uniform exploration, which
wastes many samples on armsi >2. Although there are not asymptotically
matching upper and lower bounds in the fixed budget setting, the bound of
sequential halving is known to be roughly optimal.

33.4 Notes

1 The problems studied in this chapter belong to the literature onstochastic optimization, where the simple regret is called theexpected suboptimality (gap). There are many variants of pure exploration. In the example at the start of the chapter, a medical researcher may be interested in getting the most reliable information about differences between treatments. This falls into the class of pure information seeking problems, the subject of optimal experimental design from statistics, which we have met earlier. 2 We mentioned that algorithms with logarithmic cumulative regret are not well suited for pure exploration. Supposeπhas asymptotically optimal cumulative regret onE=ENk, which means thatlimn→∞Eνπ[Ti(n)]/log(n) = 2/∆i(ν) for allν∈E. You will show in Exercise 33.6 that for anyε >0 there exists aν∈E with a unique optimal arm such that

lim infn→∞

−log (Pνπ(An+16∈i∗(ν))) log(n) ≤1 +ε. This shows that using an asymptotically optimal policy for cumulative regret minimization leads to a best-arm identification policy for which the probability

Bandit Algorithms

(

−

)

33.4 Notes

Get our desktop app

Company

Features

Documentation

Resources