Bandit Algorithms

38.6 Proof of upper bound 492

Rearranging and substituting yields

R ̃k=

∑

t∈Ek

(−vk(St) +〈Pk,At(St),vk〉)

=

∑

t∈Ek

(−vk(St) +〈PAt(St),vk〉) +

∑

t∈Ek

〈Pk,At(St)−PAt(St),vk〉

≤

∑

t∈Ek

(−vk(St) +〈PAt(St),vk〉) ︸︷︷︸ (A)

+

D

2

∑

t∈Ek

‖Pk,At(St)−PAt(St)‖ 1 ︸︷︷︸ (B)

,(38.20)

where the inequality follows from H ̈older’s inequality and Eq. (38.19). Let
Et[·] denote the conditional expectation with respect toP conditioned on
σ(S 1 ,A 1 ,...,St− 1 ,At− 1 ,St). To bound (A) we reorder the terms and use the
fact that span(vk)≤Don the eventFc. We get

(A) =

∑

t∈Ek

(vk(St+1)−vk(St) +〈PAt(St),vk〉−vk(St+1))

=vk(Sτk+1)−vk(Sτk) +

∑

t∈Ek

(〈PAt(St),vk〉−vk(St+1))

≤D+

∑

t∈Ek

(Et[vk(St+1)]−vk(St+1)),

where the second equality used thatmaxEk=τk+1−1 andminEk=τk. We
leave this here for now and move on to term (B) in Eq. (38.20). The definition of
the confidence intervals and the assumption thatFdoes not occur shows that

(B)≤D

√

LS

2

∑

(s,a)∈S×A

T(k)(s,a) √ 1 ∨Tτk− 1 (s,a)

.

Combining the bounds (A) and (B) yields

R ̃k≤D+

∑

t∈Ek

(Et[vk(St+1)]−vk(St+1)) +D

√

LS

2

∑

(s,a)∈S×A

T(k)(s,a) √ 1 ∨Tτk− 1 (s,a)

.

Step 3: Bounding the number of phases and summing LetKtbe the phase in roundtso thatt∈EKt. By the work in the previous two steps, ifFdoes not occur then

Rˆn≤

∑K

k=1

R ̃k≤KD+

∑n

t=1

(Et[vKt(St+1)]−vKt(St+1))

+

D

√

LS

2

∑

(s,a)∈S×A

∑K

k=1

T(k)(s,a) √ 1 ∨Tτk− 1 (s,a)

.

Bandit Algorithms

∑

=

∑

∑

≤

∑

+

D

2

∑

,(38.20)

∑

∑

≤D+

∑

(B)≤D

√

LS

2

∑

.

∑

√

LS

2

∑

.

∑K

+

D

√

LS

2

∑

∑K

.

Get our desktop app

Company

Features

Documentation

Resources