Bandit Algorithms

17.1 Stochastic bandits 206

and Pinsker’s inequality (Theorem 14.2) and the divergence decomposition
(Lemma 15.1) we have

P

(

R ̄n≥∆n 2

)

+P′

(

R ̄′n≥∆n 2

)

≥P

(

Ti(n)≥

n 2

)

+P′

(

Ti(n)<

n 2

)

≥

1

2

exp (−D(P,P′))≥

1

2

exp

(

− 2 B∆

√

n k− 1

)

≥ 2 δ ,

where the last line follows by choosing

∆ = min

{

1

2 ,

1

2 B

√

k− 1 n log

(

1

4 δ

)}

.

The result follows since max{a,b}≥(a+b)/2.

Corollary17.2. Letn≥ 1 andk≥ 2. Then for any policyπandδ∈(0,1)
such that

nδ≤

√

n(k−1) log

(

1

4 δ

)

(17.6)

there exists a bandit problemν∈Eksuch that

P

(

R ̄n(π,ν)≥^1 4

min

{

n,

√

n(k−1) 2

log

(

1

4 δ

)})

≥δ. (17.7)

Proof We prove the result by contradiction. Assume that the conclusion does not hold forπand letδ∈(0,1) satisfy(17.6). Then for any bandit problem ν∈Ekthe expected regret ofπis bounded by

Rn(π,ν)≤nδ+

√

n(k−1) 2

log

(

1

4 δ

)

≤

√

2 n(k−1) log

(

1

4 δ

)

.

Thereforeπsatisfies the conditions of Theorem 17.1 withB=

√

2 log(1/(4δ)),
which implies that there exists some bandit problemν∈ Eksuch that(17.7)
holds, contradicting our assumption.

Corollary17.3.Letk≥ 2 andp∈(0,1)andB > 0. Then there does not exist
a policyπsuch that for al ln≥ 1 ,δ∈(0,1)andν∈Ek,

P

(

R ̄n(π,ν)≥B√(k−1)nlogp

(

1

δ

))

< δ

Proof We proceed by contradiction. Suppose that such a policy exists. Choosing δsufficiently small andnsufficiently large ensures that 1 B

log

(

1

4 δ

)

≥Blogp

(

1

δ

)

and

1

B

√

n(k−1) log

(

1

4 δ

)

≤n.

Bandit Algorithms

P

(

)

+P′

(

)

≥P

(

)

+P′

(

)

≥

1

2

1

2

(

− 2 B∆

√

)

{

1

2 ,

1

2 B

√

(

1

)}

.

√

(

1

)

(17.6)

P

(

{

√

(

1

)})

√

(

1

)

≤

√

(

1

)

.

√

P

(

(

1

))

(

1

)

(

1

)

1

B

√

(

1

)

Get our desktop app

Company

Features

Documentation

Resources