Bandit Algorithms

33.2 Best-arm identification with a fixed confidence 379

The objective in fixed confidence best-arm identification is to find a sound learner for whichEνπ[τ] is minimized over environmentsν∈ E. Since this is a multi-objective criteria there is a priori no reason to believe that a single optimal learner should exist. Conveniently, however, the condition that the learner must satisfy Eq. (33.2) plays the role of the consistency assumption in the asymptotic lower bounds in Chapter 16, which allows for a sense of instance-dependent asymptotic optimality. The situation in finite-time is more complicated, as we discuss in Note 6.

IfEis sufficiently rich andνhas multiple optimal arms, then no sound learner can stop in finite time with positive probability. The reason is that there is no way to reject the hypothesis that one optimal arm is fractionally better than another. You will investigate this in Exercise 33.11. Also note that in our definitionI{τ=t}is a deterministic function ofA 1 ,X 1 ,...,At,Xt,At+1. None of the results that follow would change if you allowedτto also depend on some exogenous source of randomness.

33.2.1 Lower bound

We start with the lower bound, which serves as a target for the upper bound to follow. LetEbe an arbitrary set ofk-armed stochastic bandit environments and forν∈Edefine

i∗(ν) = argmaxi∈[k]μi(ν) and Ealt(ν) ={ν′∈E:i∗(ν′)∩i∗(ν) =∅}, which is the set of bandits inEwith different optimal arms thanν.

Theorem33.5. Assume that(π,τ)is sound forEat confidence levelδ∈(0,1) and letν∈E. ThenEνπ[τ]≥c∗(ν) log

( 4

δ

)

, where

c∗(ν)−^1 = sup α∈Pk− 1

(

inf ν′∈Ealt(ν)

(k ∑

i=1

αiD(νi,νi′)

))

(33.3)

withc∗(ν) =∞whenc∗(ν)−^1 = 0.

Proof The result is trivial whenEνπ[τ] =∞. For the remainder assume that Eνπ[τ]<∞, which implies thatPνπ(τ=∞) = 0. Next letν′∈Ealt(ν) and define eventE={τ <∞andAτ+1∈/i∗(ν′)}∈Fτ. Then, 2 δ≥Pνπ(τ <∞andAτ+1∈/i∗(ν)) +Pν′π(τ <∞andAτ+1∈/i∗(ν′)) ≥Pνπ(Ec) +Pν′π(E)

≥^1 2

exp

(

−

∑k

i=1

Eνπ[Ti(τ)] D(νi,ν′i)

)

, (33.4)

where the first inequality follows from the definition of soundness and the last

Bandit Algorithms

( 4

)

(

))

(33.3)

(

−

)

, (33.4)

Get our desktop app

Company

Features

Documentation

Resources