Bandit Algorithms

(Jeff_L) #1
33.2 Best-arm identification with a fixed confidence 378

Proof By the regret decomposition identity (4.5),

Rn(π,ν) =nE

[k

i=1

∆i

Ti(n)
n

]


=nE

[


∆An+1

]


=nRsimplen ((πt)nt=1+1,ν),

where the first equality follows from the definition of the cumulative regret, the
third from the definition ofπn+1and the last from the definition of the simple
regret.

Corollary33.3.For allnthere exists a policyπsuch that for allν∈ESGk(1)
with∆(ν)∈[0,1]kit holds thatRsimplen (π,ν)≤C


k/n, whereCis a universal
constant.

Proof Combine the previous result with Theorem 9.1.

Proposition 33.2 raises our hopes that policies designed for minimizing the
cumulative regret might also have well-behaved simple regret. Unfortunately this
is only true in the intermediate regimes where the best arm is hard to identify.
Policies with small cumulative regret spend most of their time playing the optimal
arm and play suboptimal arms just barely enough to ensure they are not optimal.
In pure exploration this leads to a highly suboptimal policy for which the simple
regret is asymptotically polynomial, while we know from Theorem 33.1 that the
simple regret should asymptotically decrease exponentially fast. More details and
pointers to the literature are given in Note 2 at the end of the chapter.

33.2 Best-arm identification with a fixed confidence


Best-arm identification is a variant of pure exploration where the learner is
rewarded only for identifying an exactly optimal arm. There are two variants of
best-arm identification. In this section we consider the so-calledfixed confidence
setting when the learner is given a confidence levelδ∈(0,1) and should use as
few samples as possible to output an arm that is optimal with probability at least
1 −δ. The other variant, the so-calledfixed budgetsetting is considered in the
next section. In this variant the learner has a budget ofnrounds and the goal is
to minimize the probability of selecting a suboptimal arm.
In the fixed confidence setting the learner chooses a policyπ= (πt)∞t=1as
normal. However, the number of rounds is not fixed in advance and in addition
to the policy the learner also chooses when to stop. Denoting byτthe time
when the learner stops, it holds thatτis anF= (Ft)∞t=0-stopping time where
Ft=σ(A 1 ,X 1 ,...,At,Xt,At+1). The action at timeτ+ 1 serves as the output.

Definition33.4.A policy and stopping time pair (π,τ) issoundat confidence
levelδ∈(0,1) for environment classEif for allν∈E,

Pνπ

(


τ <∞and ∆Aτ+1(ν)> 0

)


≤δ. (33.2)
Free download pdf