38.6 Proof of upper bound 492
Rearranging and substituting yields
R ̃k=
∑
t∈Ek
(−vk(St) +〈Pk,At(St),vk〉)
=
∑
t∈Ek
(−vk(St) +〈PAt(St),vk〉) +
∑
t∈Ek
〈Pk,At(St)−PAt(St),vk〉
≤
∑
t∈Ek
(−vk(St) +〈PAt(St),vk〉)
︸ ︷︷ ︸
(A)
+
D
2
∑
t∈Ek
‖Pk,At(St)−PAt(St)‖ 1
︸ ︷︷ ︸
(B)
,(38.20)
where the inequality follows from H ̈older’s inequality and Eq. (38.19). Let
Et[·] denote the conditional expectation with respect toP conditioned on
σ(S 1 ,A 1 ,...,St− 1 ,At− 1 ,St). To bound (A) we reorder the terms and use the
fact that span(vk)≤Don the eventFc. We get
(A) =
∑
t∈Ek
(vk(St+1)−vk(St) +〈PAt(St),vk〉−vk(St+1))
=vk(Sτk+1)−vk(Sτk) +
∑
t∈Ek
(〈PAt(St),vk〉−vk(St+1))
≤D+
∑
t∈Ek
(Et[vk(St+1)]−vk(St+1)),
where the second equality used thatmaxEk=τk+1−1 andminEk=τk. We
leave this here for now and move on to term (B) in Eq. (38.20). The definition of
the confidence intervals and the assumption thatFdoes not occur shows that
(B)≤D
√
LS
2
∑
(s,a)∈S×A
T(k)(s,a)
√
1 ∨Tτk− 1 (s,a)
.
Combining the bounds (A) and (B) yields
R ̃k≤D+
∑
t∈Ek
(Et[vk(St+1)]−vk(St+1)) +D
√
LS
2
∑
(s,a)∈S×A
T(k)(s,a)
√
1 ∨Tτk− 1 (s,a)
.
Step 3: Bounding the number of phases and summing
LetKtbe the phase in roundtso thatt∈EKt. By the work in the previous two
steps, ifFdoes not occur then
Rˆn≤
∑K
k=1
R ̃k≤KD+
∑n
t=1
(Et[vKt(St+1)]−vKt(St+1))
+
D
√
LS
2
∑
(s,a)∈S×A
∑K
k=1
T(k)(s,a)
√
1 ∨Tτk− 1 (s,a)