Bandit Algorithms

17.1 Stochastic bandits 205

On the other hand, if the high-probability bound only holds for a singleδas in (17.2), then it seems hard to do much better than

Rn≤nδ+O

(√

knlog

(

k δ

))

,

which with the best choice ofδleads to a bound ofO(

√

knlog(n)).

17.1 Stochastic bandits

For simplicity we start with the stochastic setting before explaining how to convert the arguments to the adversarial model. There is no randomness in the expected regret, so in order to derive a high probability bound we define the random pseudo regretby

R ̄n=

∑k

i=1

Ti(n)∆i,

which is a random variable through the pull countsTi(n).

For all results in this section we letEk⊂ ENk denote the set ofk-armed Gaussian bandits with suboptimality gaps bounded by one. Forμ∈[0,1]d we letνμ∈Ekbe the Gaussian bandit with meansμ.

Theorem17.1.Letn≥ 1 andk≥ 2 andB > 0 andπbe a policy such that for anyν∈Ek,

Rn(π,ν)≤B

√

(k−1)n. (17.4)

Letδ∈(0,1). Then there exists a banditνinEksuch that

P

(

R ̄n(π,ν)≥^1 4

min

{

n,

1

B

√

(k−1)nlog

(

1

4 δ

)})

≥δ.

Proof Let ∆∈(0, 1 /2] be a constant to be tuned subsequently andν=νμwhere the mean vectorμ∈Rdis defined byμ 1 = ∆ andμi= 0 fori >1. Abbreviate Rn=Rn(π,ν) andP=PνπandE=Eνπ. Leti=argmini> 1 E[Ti(n)]. Then by Lemma 4.5 and the assumption in Eq. (17.4),

E[Ti(n)]≤ Rn ∆(k−1)

≤B

∆

√

n k− 1

. (17.5)

Define alternative banditν′=νμ′whereμ′∈Rdis equal toμexceptμ′i=μi+2∆. AbbreviateP′=Pν′πandR ̄n=R ̄n(π,ν) andR ̄n′ =R ̄n(π,ν′). By Lemma 4.5

Bandit Algorithms

(√

(

))

,

√

17.1 Stochastic bandits

√

P

(

{

1

B

√

(

1

)})

≤B

∆

√

. (17.5)

Get our desktop app

Company

Features

Documentation

Resources