Bandit Algorithms

(Jeff_L) #1
38.6 Proof of upper bound 492

Rearranging and substituting yields

R ̃k=


t∈Ek

(−vk(St) +〈Pk,At(St),vk〉)

=



t∈Ek

(−vk(St) +〈PAt(St),vk〉) +


t∈Ek

〈Pk,At(St)−PAt(St),vk〉



t∈Ek

(−vk(St) +〈PAt(St),vk〉)
︸ ︷︷ ︸
(A)

+


D


2



t∈Ek

‖Pk,At(St)−PAt(St)‖ 1
︸ ︷︷ ︸
(B)

,(38.20)


where the inequality follows from H ̈older’s inequality and Eq. (38.19). Let
Et[·] denote the conditional expectation with respect toP conditioned on
σ(S 1 ,A 1 ,...,St− 1 ,At− 1 ,St). To bound (A) we reorder the terms and use the
fact that span(vk)≤Don the eventFc. We get


(A) =


t∈Ek

(vk(St+1)−vk(St) +〈PAt(St),vk〉−vk(St+1))

=vk(Sτk+1)−vk(Sτk) +


t∈Ek

(〈PAt(St),vk〉−vk(St+1))

≤D+



t∈Ek

(Et[vk(St+1)]−vk(St+1)),

where the second equality used thatmaxEk=τk+1−1 andminEk=τk. We
leave this here for now and move on to term (B) in Eq. (38.20). The definition of
the confidence intervals and the assumption thatFdoes not occur shows that


(B)≤D



LS


2



(s,a)∈S×A

T(k)(s,a)

1 ∨Tτk− 1 (s,a)

.


Combining the bounds (A) and (B) yields

R ̃k≤D+


t∈Ek

(Et[vk(St+1)]−vk(St+1)) +D


LS


2



(s,a)∈S×A

T(k)(s,a)

1 ∨Tτk− 1 (s,a)

.


Step 3: Bounding the number of phases and summing
LetKtbe the phase in roundtso thatt∈EKt. By the work in the previous two
steps, ifFdoes not occur then

Rˆn≤

∑K


k=1

R ̃k≤KD+

∑n

t=1

(Et[vKt(St+1)]−vKt(St+1))

+


D



LS


2



(s,a)∈S×A

∑K


k=1

T(k)(s,a)

1 ∨Tτk− 1 (s,a)

.

Free download pdf