Bandit Algorithms

(Jeff_L) #1
15.5 Exercises 195

whereC >0 is a universal constant. Prove that the dependence on the sum
cannot be eliminated.


Hint You will have to use thatTi(t) is an integer for allt.

15.6(Lower bound for Explore-Then-Commit) LetETCnmbe the Explore-
Then-Commit policy with inputsnandmrespectively (Algorithm 1). Prove that
for allmthere exists aμ∈[0,1]ksuch that


Rn(ETCnm,νμ)≥cmin

{


n, n^2 /^3 k^1 /^3

}


,


wherec >0 is a universal constant.


15.7(Stopping time version of divergence decomposition) Consider
the setting of Lemma 15.1 and letFt=σ(A 1 ,X 1 ,...,At,Xt) andτbe an (Ft)-
measurable stopping time. Then for any random elementXthat isFτ-measurable,


D(PνX,Pν′X)≤

∑k

i=1

Eν[Ti(τ)] D(Pi,Pi′),

wherePνXandPν′Xare the lawsXunderνandν′respectively.


15.8(Divergence decomposition for more general action spaces) The
purpose of this exercise is to show that the divergence decomposition lemma
(Lemma 15.1) continues to hold for more general action spaces (A,G). Starting
from the setup of Section 4.7, letPν=PνπandPν′=Pν′πbe the measures on the
canonical bandit model induced by the interconnection ofπandν(respectively,
πandν′).


(a) Prove that


D(Pν,Pν′) =


A

D(Pa,Pa′)dGν(a) (15.5)

whereGνis a measure on (A,G) defined byGν(B) =Eν[

∑n
t=1I{At∈B}].
(b) Prove that


D(Pν,Pν′) =E

[n

t=1

D(PAt,PA′t)

]


.


Hint Use an appropriately adjusted form of the chain rule for relative entropy
from Exercise 14.11.
Free download pdf