38.6 Proof of upper bound 492Rearranging and substituting yieldsR ̃k=∑
t∈Ek(−vk(St) +〈Pk,At(St),vk〉)=
∑
t∈Ek(−vk(St) +〈PAt(St),vk〉) +∑
t∈Ek〈Pk,At(St)−PAt(St),vk〉≤
∑
t∈Ek(−vk(St) +〈PAt(St),vk〉)
︸ ︷︷ ︸
(A)+
D
2
∑
t∈Ek‖Pk,At(St)−PAt(St)‖ 1
︸ ︷︷ ︸
(B),(38.20)
where the inequality follows from H ̈older’s inequality and Eq. (38.19). Let
Et[·] denote the conditional expectation with respect toP conditioned on
σ(S 1 ,A 1 ,...,St− 1 ,At− 1 ,St). To bound (A) we reorder the terms and use the
fact that span(vk)≤Don the eventFc. We get
(A) =∑
t∈Ek(vk(St+1)−vk(St) +〈PAt(St),vk〉−vk(St+1))=vk(Sτk+1)−vk(Sτk) +∑
t∈Ek(〈PAt(St),vk〉−vk(St+1))≤D+
∑
t∈Ek(Et[vk(St+1)]−vk(St+1)),where the second equality used thatmaxEk=τk+1−1 andminEk=τk. We
leave this here for now and move on to term (B) in Eq. (38.20). The definition of
the confidence intervals and the assumption thatFdoes not occur shows that
(B)≤D
√
LS
2
∑
(s,a)∈S×AT(k)(s,a)
√
1 ∨Tτk− 1 (s,a).
Combining the bounds (A) and (B) yieldsR ̃k≤D+∑
t∈Ek(Et[vk(St+1)]−vk(St+1)) +D√
LS
2
∑
(s,a)∈S×AT(k)(s,a)
√
1 ∨Tτk− 1 (s,a).
Step 3: Bounding the number of phases and summing
LetKtbe the phase in roundtso thatt∈EKt. By the work in the previous two
steps, ifFdoes not occur thenRˆn≤∑K
k=1R ̃k≤KD+∑nt=1(Et[vKt(St+1)]−vKt(St+1))+
D
√
LS
2
∑
(s,a)∈S×A∑K
k=1T(k)(s,a)
√
1 ∨Tτk− 1 (s,a)