Bandit Algorithms

33.6 Exercises 391

(a)For eachε >0 andδ∈(0,1) and number of armsk >1 design a policyπ
and stopping timeτsuch that for allν∈E,

Pνπ(∆Aτ≥ε)≤δ and Eνπ[τ]≤Ck ε^2

log

(

k δ

)

,

for universal constantC >0.
(b)It turns out the logarithmic dependence onkcan be eliminated. Design a
policyπand stopping timeτsuch that for allν∈E,

Pνπ(∆Aτ≥ε)≤δ and Eνπ[τ]≤Ck ε^2

log

(

1

δ

)

.

(c)Prove a lower bound showing that the bound in part (b) is tight up to constant factors in the worst case.

Hint Part (b) of the above exercise is a challenging problem. The simplest approach is to use an elimination algorithm that operates in phases where at the end of each phase the bottom half of the arms (in terms of their empirical estimates) are eliminated. For details see the paper by Even-Dar et al. [2002]. 33.6(Suboptimality of cumulative regret algorithms for best-arm identification) Supposeπis an asymptotically optimal bandit policy inE=ENk in the sense that

nlim→∞Rn(π,ν) log(n)

=

∑

i:∆i(ν)> 0

2

∆i(ν)

for allν∈E.

(a)For anyε >0 prove there exists aν∈Ewith a unique optimal arm such that

lim infn→∞

−log(Pνπ(∆An+1>0)) log(n)

≤1 +ε.

(b) Can you prove the same result with lim inf replaced by lim sup?
(c)What happens if the assumption thatπis asymptotically optimal is replaced
with the assumption that there exists a universal constantC >0 such that

Rn(π,ν)≤C

∑

i:∆i(ν)> 0

(

∆i(ν) +

log(n) ∆i(ν)

)

.

33.7(Analysis of TrackAndStop) In this exercise you will complete the proof of Theorem 33.6. Assume thatνhas a unique optimal arm. MakeEa metric space via the metricd(ν 1 ,ν 2 ) =‖μ(ν 1 )−μ(ν 2 )‖∞. Letε >0 be a small constant and define random times

τν(ε) = max{t:d(ˆνt,ν)≥ε} τα(ε) = max{t:‖α∗(ν)−α∗(ˆνt)‖∞≥ε} τT(ε) = max{t:‖T(t)/t−α∗i(ν)‖∞≥ε}.

Show that

Bandit Algorithms

(

)

,

(

1

)

.

=

∑

2

∑

(

)

.

Get our desktop app

Company

Features

Documentation

Resources