Bandit Algorithms

(Jeff_L) #1
16.2 Finite-time bounds 199

M P dinf(P,μ∗,M)

{N(μ,σ^2 ) :μ∈R} N(μ,σ^2 ) (μ−μ

∗) 2
2 σ^2

{N(μ,σ^2 ) :μ∈R,σ^2 ∈(0,∞)} N(μ,σ^2 )^12 log

(


1 +(μ−μ

∗) 2
2 σ^2

)


{B(μ) :μ∈[0,1]} B(μ) μlog

(


μ
μ∗

)


+ (1−μ) log

(


1 −μ
1 −μ∗

)


{U(a,b) :a,b∈R} U(a,b) log

(


1 +2((a+b)/^2 −μ

∗) 2
b−a

)


Table 16.1Expressions fordinffor different parametric families when the mean ofPis less
thanμ∗.

16.2 Finite-time bounds


By making a finite-time analogue of consistency it is possible to prove a finite-time
instance-dependent bound. First, a lemma that summarizes what can be obtained
by chaining the high-probability Pinsker inequality (Theorem 14.2) with the
divergence decomposition lemma (Lemma 15.1).

Lemma16.3. Letν= (Pi)andν′= (Pi′)bek-action stochastic bandits that
differ only in the distribution of the reward for actioni∈[k]. Assume thatiis
suboptimal inνand uniquely optimal inν′. Letλ=μi(ν′)−μi(ν). Then for any
policyπ,

Eνπ[Ti(n)]≥

log

(min{λ−∆
i(ν),∆i(ν)}
4

)


+ log(n)−log(Rn(ν) +Rn(ν′))
D(Pi,Pi′)

. (16.4)


The lemma holds for finitenand anyνand can be used to derive finite-
time instance-dependent lower bounds for any environment classEthat is rich
enough. The following result provides a finite-time instance-dependence bound
for Gaussian bandits where the asymptotic notion of consistency is replaced by
an assumption that the minimax regret is not too large. This assumption alone
is enough to show that no policy that is remotely close to minimax optimal can
be much better than UCB on any instance.

Theorem16.4.Letν∈ ENk be ak-armed Gaussian bandit with mean vector
μ∈Rkand suboptimality gaps∆∈[0,∞)k. Let

E(ν) ={ν′∈EkN:μi(ν′)∈[μi,μi+ 2∆i]}.

Suppose C > 0 and p ∈ (0,1) are constants andπis a policy such that
Free download pdf