38.7 Proof of lower bound 495
One could almost claim victory here and not bother with the proof. As usual,
however, there are some technical difficulties, which in this case arise because the
number of visits to the decision states◦is a random quantity. For this reason we
give the proof, leaving as exercises the parts that are both obvious and annoying.
Proof of Theorem 38.7 The proof follows the path suggested in Exercise 15.2.
We break things up into two steps. Throughout we fix an arbitrary policyπ.
Step 1: Notation and facts about the MDP
Letdbe the depth of the tree in the MDP construction,Lthe number of leaves
andk=LA. Define the set of state/action pairs for which the state is a leaf of
the tree by
{(s,a) :a∈Aandsis a leaf of the tree}.
By definition haskelements. LetM 0 be the MDP withε(s,a) = 0 for all (s,a)L.
Then letMjbe the MDP withε(s,a) = ∆ for thejth state-action pair in the
above set. Define stopping timeτby
τ=n∧min
{
t:
∑t
u=1
I{Su=s◦}≥ n
D
− 1
}
,
which is the first round when the number of visits to states◦is at leastn/D−1,
ornifs◦is visited fewer times thann/D. Next letTjbe the number of visits to
state-action pairj∈[k] until stopping timeτandTσ=
∑k
j=1Tj. For 0≤j≤k,
letPj be the law ofT 1 ,...,Tkinduced by the interaction ofπandMj. And
Ej[·] be the expectation with respect toPj. None of the following claims are
surprising, but they are all tiresome to prove to some extent. The claims are
listed in increasing order of difficulty and left to the reader in Exercise 38.25.
Claim38.9.For al lj∈[k]the diameter is bounded byD(Mj)≤D.
Claim38.10.There exist universal constants 0 < c 1 < c 2 <∞such that
DE 0 [Tσ]/n∈[c 1 ,c 2 ].
Claim38.11.LetRnj be the expected regret of policyπin MDPMjovern
rounds. There exists a universal constantc 3 > 0 such that
Rnj≥c 3 ∆DEj[Tσ−Tj].
Step 2: Bounding the regret
Notice thatM 0 andMj only differ when state-action pair j is visited. In
Exercise 38.30 you are invited to use this fact and the chain rule for relative
entropy given in Exercise 14.12 to prove that
D(P 0 ,Pj) =E 0 [Tj]d(1/ 2 , 1 /2 + ∆), (38.22)