Bandit Algorithms

38.7 Proof of lower bound 495

One could almost claim victory here and not bother with the proof. As usual, however, there are some technical difficulties, which in this case arise because the number of visits to the decision states◦is a random quantity. For this reason we give the proof, leaving as exercises the parts that are both obvious and annoying.

Proof of Theorem 38.7 The proof follows the path suggested in Exercise 15.2.
We break things up into two steps. Throughout we fix an arbitrary policyπ.

Step 1: Notation and facts about the MDP Letdbe the depth of the tree in the MDP construction,Lthe number of leaves andk=LA. Define the set of state/action pairs for which the state is a leaf of the tree by

{(s,a) :a∈Aandsis a leaf of the tree}.

By definition haskelements. LetM 0 be the MDP withε(s,a) = 0 for all (s,a)L.
Then letMjbe the MDP withε(s,a) = ∆ for thejth state-action pair in the
above set. Define stopping timeτby

τ=n∧min

{

t:

∑t

u=1

I{Su=s◦}≥ n D

− 1

}

,

which is the first round when the number of visits to states◦is at leastn/D−1,
ornifs◦is visited fewer times thann/D. Next letTjbe the number of visits to
state-action pairj∈[k] until stopping timeτandTσ=

∑k j=1Tj. For 0≤j≤k, letPj be the law ofT 1 ,...,Tkinduced by the interaction ofπandMj. And Ej[·] be the expectation with respect toPj. None of the following claims are surprising, but they are all tiresome to prove to some extent. The claims are listed in increasing order of difficulty and left to the reader in Exercise 38.25.

Claim38.9.For al lj∈[k]the diameter is bounded byD(Mj)≤D.

Claim38.10.There exist universal constants 0 < c 1 < c 2 <∞such that

DE 0 [Tσ]/n∈[c 1 ,c 2 ].

Claim38.11.LetRnj be the expected regret of policyπin MDPMjovern
rounds. There exists a universal constantc 3 > 0 such that

Rnj≥c 3 ∆DEj[Tσ−Tj].

Step 2: Bounding the regret Notice thatM 0 andMj only differ when state-action pair j is visited. In Exercise 38.30 you are invited to use this fact and the chain rule for relative entropy given in Exercise 14.12 to prove that

D(P 0 ,Pj) =E 0 [Tj]d(1/ 2 , 1 /2 + ∆), (38.22)

Bandit Algorithms

{

− 1

}

,

Get our desktop app

Company

Features

Documentation

Resources