Bandit Algorithms

(Jeff_L) #1
17.1 Stochastic bandits 206

and Pinsker’s inequality (Theorem 14.2) and the divergence decomposition
(Lemma 15.1) we have


P


(


R ̄n≥∆n
2

)


+P′


(


R ̄′n≥∆n
2

)


≥P


(


Ti(n)≥

n
2

)


+P′


(


Ti(n)<

n
2

)



1


2


exp (−D(P,P′))≥

1


2


exp

(


− 2 B∆



n
k− 1

)


≥ 2 δ ,

where the last line follows by choosing


∆ = min

{


1


2 ,


1


2 B



k− 1
n log

(


1


4 δ

)}


.


The result follows since max{a,b}≥(a+b)/2.


Corollary17.2. Letn≥ 1 andk≥ 2. Then for any policyπandδ∈(0,1)
such that


nδ≤


n(k−1) log

(


1


4 δ

)


(17.6)


there exists a bandit problemν∈Eksuch that


P


(


R ̄n(π,ν)≥^1
4

min

{


n,


n(k−1)
2

log

(


1


4 δ

)})


≥δ. (17.7)

Proof We prove the result by contradiction. Assume that the conclusion does
not hold forπand letδ∈(0,1) satisfy(17.6). Then for any bandit problem
ν∈Ekthe expected regret ofπis bounded by

Rn(π,ν)≤nδ+


n(k−1)
2

log

(


1


4 δ

)




2 n(k−1) log

(


1


4 δ

)


.


Thereforeπsatisfies the conditions of Theorem 17.1 withB=



2 log(1/(4δ)),
which implies that there exists some bandit problemν∈ Eksuch that(17.7)
holds, contradicting our assumption.


Corollary17.3.Letk≥ 2 andp∈(0,1)andB > 0. Then there does not exist
a policyπsuch that for al ln≥ 1 ,δ∈(0,1)andν∈Ek,


P


(


R ̄n(π,ν)≥B√(k−1)nlogp

(


1


δ

))


< δ

Proof We proceed by contradiction. Suppose that such a policy exists. Choosing
δsufficiently small andnsufficiently large ensures that
1
B

log

(


1


4 δ

)


≥Blogp

(


1


δ

)


and

1


B



n(k−1) log

(


1


4 δ

)


≤n.
Free download pdf