Bandit Algorithms

33.2 Best-arm identification with a fixed confidence 378

Proof By the regret decomposition identity (4.5),

Rn(π,ν) =nE

[k ∑

i=1

∆i

Ti(n) n

]

=nE

[

∆An+1

]

=nRsimplen ((πt)nt=1+1,ν),

where the first equality follows from the definition of the cumulative regret, the third from the definition ofπn+1and the last from the definition of the simple regret.

Corollary33.3.For allnthere exists a policyπsuch that for allν∈ESGk(1) with∆(ν)∈[0,1]kit holds thatRsimplen (π,ν)≤C

√

k/n, whereCis a universal constant.

Proof Combine the previous result with Theorem 9.1.

Proposition 33.2 raises our hopes that policies designed for minimizing the cumulative regret might also have well-behaved simple regret. Unfortunately this is only true in the intermediate regimes where the best arm is hard to identify. Policies with small cumulative regret spend most of their time playing the optimal arm and play suboptimal arms just barely enough to ensure they are not optimal. In pure exploration this leads to a highly suboptimal policy for which the simple regret is asymptotically polynomial, while we know from Theorem 33.1 that the simple regret should asymptotically decrease exponentially fast. More details and pointers to the literature are given in Note 2 at the end of the chapter.

33.2 Best-arm identification with a fixed confidence

Best-arm identification is a variant of pure exploration where the learner is rewarded only for identifying an exactly optimal arm. There are two variants of best-arm identification. In this section we consider the so-calledfixed confidence setting when the learner is given a confidence levelδ∈(0,1) and should use as few samples as possible to output an arm that is optimal with probability at least 1 −δ. The other variant, the so-calledfixed budgetsetting is considered in the next section. In this variant the learner has a budget ofnrounds and the goal is to minimize the probability of selecting a suboptimal arm. In the fixed confidence setting the learner chooses a policyπ= (πt)∞t=1as normal. However, the number of rounds is not fixed in advance and in addition to the policy the learner also chooses when to stop. Denoting byτthe time when the learner stops, it holds thatτis anF= (Ft)∞t=0-stopping time where Ft=σ(A 1 ,X 1 ,...,At,Xt,At+1). The action at timeτ+ 1 serves as the output.

Definition33.4.A policy and stopping time pair (π,τ) issoundat confidence levelδ∈(0,1) for environment classEif for allν∈E,

Pνπ

(

τ <∞and ∆Aτ+1(ν)> 0

)

≤δ. (33.2)

Bandit Algorithms

]

[

]

√

33.2 Best-arm identification with a fixed confidence

(

)

Get our desktop app

Company

Features

Documentation

Resources