38.6 Proof of upper bound 491
Step 1: Failure events and decomposition
LetKbe the (random) number of phases and fork∈[K] letEk={τk,τk+
1 ,...,τk+1− 1 }be the set of rounds in thekth phase whereτK+1is defined to
ben+ 1. LetT(k)(s,a) be the number of times state-action pairs,ais visited in
thekth phase:
T(k)(s,a) =
∑
t∈Ek
I{St=s,At=a}.
DefineFas the failure event thatP /∈Cτkfor somek∈[K].
Lemma38.8.P(F)≤δ/ 2.
The proof is based on a concentration inequality derived for categorical
distributions and is left for Exercise 38.21. WhenF does not hold, the true
transition kernel is inCτkfor allk, which means thatρ∗≤ρkand
Rˆn=
∑n
t=1
(ρ∗−rAt(St))≤
∑K
k=1
∑
t∈Ek
(ρk−rAt(St))
︸ ︷︷ ︸
R ̃k
.
In the next step we boundR ̃kunder the assumption thatFdoes not hold.
Step 2: Bounding the regret in each phase
Assume thatFdoes not occur and fixk∈[K]. Recall thatvkis a value function
satisfying the Bellman optimality equation in the optimistic MDPMkandρkis
its gain. Hence
ρk=rπk(s)−vk(s) +〈Pk,πk(s),vk〉 for alls∈S. (38.18)
As noted earlier, solutions to the Bellman optimality equation remain solutions
when translated so we may assume without loss of generality thatvkis such that
‖vk‖∞≤span(vk)/2, which means that
‖vk‖∞≤^1
2
span(vk)≤D
2
, (38.19)
where the second inequality follows from Lemma 38.3 and the fact that whenF
does not hold the diameter of the extended MDPM ̃kis at mostDandvkalso
satisfies the Bellman-optimality equation in this MDP. By the definition of the
policy we haveAt=πk(St) fort∈Ek, which implies that
ρk=rAt(St)−vk(St) +〈Pk,At(St),vk〉 for allt∈Ek.