Bandit Algorithms

16.2 Finite-time bounds 199

M P dinf(P,μ∗,M)

{N(μ,σ^2 ) :μ∈R} N(μ,σ^2 ) (μ−μ

∗) 2 2 σ^2

{N(μ,σ^2 ) :μ∈R,σ^2 ∈(0,∞)} N(μ,σ^2 )^12 log

(

1 +(μ−μ

∗) 2 2 σ^2

)

{B(μ) :μ∈[0,1]} B(μ) μlog

(

μ μ∗

)

+ (1−μ) log

(

1 −μ 1 −μ∗

)

{U(a,b) :a,b∈R} U(a,b) log

(

1 +2((a+b)/^2 −μ

∗) 2 b−a

)

Table 16.1Expressions fordinffor different parametric families when the mean ofPis less thanμ∗.

16.2 Finite-time bounds

By making a finite-time analogue of consistency it is possible to prove a finite-time instance-dependent bound. First, a lemma that summarizes what can be obtained by chaining the high-probability Pinsker inequality (Theorem 14.2) with the divergence decomposition lemma (Lemma 15.1).

Lemma16.3. Letν= (Pi)andν′= (Pi′)bek-action stochastic bandits that differ only in the distribution of the reward for actioni∈[k]. Assume thatiis suboptimal inνand uniquely optimal inν′. Letλ=μi(ν′)−μi(ν). Then for any policyπ,

Eνπ[Ti(n)]≥

log

(min{λ−∆ i(ν),∆i(ν)} 4

)

+ log(n)−log(Rn(ν) +Rn(ν′)) D(Pi,Pi′)

. (16.4)

The lemma holds for finitenand anyνand can be used to derive finite- time instance-dependent lower bounds for any environment classEthat is rich enough. The following result provides a finite-time instance-dependence bound for Gaussian bandits where the asymptotic notion of consistency is replaced by an assumption that the minimax regret is not too large. This assumption alone is enough to show that no policy that is remotely close to minimax optimal can be much better than UCB on any instance.

Theorem16.4.Letν∈ ENk be ak-armed Gaussian bandit with mean vector μ∈Rkand suboptimality gaps∆∈[0,∞)k. Let

E(ν) ={ν′∈EkN:μi(ν′)∈[μi,μi+ 2∆i]}.

Suppose C > 0 and p ∈ (0,1) are constants andπis a policy such that

Bandit Algorithms

(

)

(

)

(

)

(

)

16.2 Finite-time bounds

)

. (16.4)

Get our desktop app

Company

Features

Documentation

Resources