Bandit Algorithms

(Jeff_L) #1
38.6 Proof of upper bound 491

Step 1: Failure events and decomposition
LetKbe the (random) number of phases and fork∈[K] letEk={τk,τk+
1 ,...,τk+1− 1 }be the set of rounds in thekth phase whereτK+1is defined to
ben+ 1. LetT(k)(s,a) be the number of times state-action pairs,ais visited in
thekth phase:


T(k)(s,a) =


t∈Ek

I{St=s,At=a}.

DefineFas the failure event thatP /∈Cτkfor somek∈[K].

Lemma38.8.P(F)≤δ/ 2.

The proof is based on a concentration inequality derived for categorical
distributions and is left for Exercise 38.21. WhenF does not hold, the true
transition kernel is inCτkfor allk, which means thatρ∗≤ρkand

Rˆn=

∑n

t=1

(ρ∗−rAt(St))≤

∑K


k=1


t∈Ek

(ρk−rAt(St))
︸ ︷︷ ︸
R ̃k

.


In the next step we boundR ̃kunder the assumption thatFdoes not hold.

Step 2: Bounding the regret in each phase
Assume thatFdoes not occur and fixk∈[K]. Recall thatvkis a value function
satisfying the Bellman optimality equation in the optimistic MDPMkandρkis
its gain. Hence


ρk=rπk(s)−vk(s) +〈Pk,πk(s),vk〉 for alls∈S. (38.18)

As noted earlier, solutions to the Bellman optimality equation remain solutions
when translated so we may assume without loss of generality thatvkis such that
‖vk‖∞≤span(vk)/2, which means that


‖vk‖∞≤^1
2

span(vk)≤D
2

, (38.19)


where the second inequality follows from Lemma 38.3 and the fact that whenF
does not hold the diameter of the extended MDPM ̃kis at mostDandvkalso
satisfies the Bellman-optimality equation in this MDP. By the definition of the
policy we haveAt=πk(St) fort∈Ek, which implies that


ρk=rAt(St)−vk(St) +〈Pk,At(St),vk〉 for allt∈Ek.
Free download pdf