33.6 Exercises 391
(a)For eachε >0 andδ∈(0,1) and number of armsk >1 design a policyπ
and stopping timeτsuch that for allν∈E,
Pνπ(∆Aτ≥ε)≤δ and Eνπ[τ]≤Ck
ε^2
log
(
k
δ
)
,
for universal constantC >0.
(b)It turns out the logarithmic dependence onkcan be eliminated. Design a
policyπand stopping timeτsuch that for allν∈E,
Pνπ(∆Aτ≥ε)≤δ and Eνπ[τ]≤Ck
ε^2
log
(
1
δ
)
.
(c)Prove a lower bound showing that the bound in part (b) is tight up to
constant factors in the worst case.
Hint Part (b) of the above exercise is a challenging problem. The simplest
approach is to use an elimination algorithm that operates in phases where at
the end of each phase the bottom half of the arms (in terms of their empirical
estimates) are eliminated. For details see the paper by Even-Dar et al. [2002].
33.6(Suboptimality of cumulative regret algorithms for best-arm
identification) Supposeπis an asymptotically optimal bandit policy inE=ENk
in the sense that
nlim→∞Rn(π,ν)
log(n)
=
∑
i:∆i(ν)> 0
2
∆i(ν)
for allν∈E.
(a)For anyε >0 prove there exists aν∈Ewith a unique optimal arm such that
lim infn→∞
−log(Pνπ(∆An+1>0))
log(n)
≤1 +ε.
(b) Can you prove the same result with lim inf replaced by lim sup?
(c)What happens if the assumption thatπis asymptotically optimal is replaced
with the assumption that there exists a universal constantC >0 such that
Rn(π,ν)≤C
∑
i:∆i(ν)> 0
(
∆i(ν) +
log(n)
∆i(ν)
)
.
33.7(Analysis of TrackAndStop) In this exercise you will complete the
proof of Theorem 33.6. Assume thatνhas a unique optimal arm. MakeEa
metric space via the metricd(ν 1 ,ν 2 ) =‖μ(ν 1 )−μ(ν 2 )‖∞. Letε >0 be a small
constant and define random times
τν(ε) = max{t:d(ˆνt,ν)≥ε}
τα(ε) = max{t:‖α∗(ν)−α∗(ˆνt)‖∞≥ε}
τT(ε) = max{t:‖T(t)/t−α∗i(ν)‖∞≥ε}.
Show that