Bandit Algorithms

38.6 Proof of upper bound 491

Step 1: Failure events and decomposition
LetKbe the (random) number of phases and fork∈[K] letEk={τk,τk+
1 ,...,τk+1− 1 }be the set of rounds in thekth phase whereτK+1is defined to
ben+ 1. LetT(k)(s,a) be the number of times state-action pairs,ais visited in
thekth phase:

T(k)(s,a) =

∑

t∈Ek

I{St=s,At=a}.

DefineFas the failure event thatP /∈Cτkfor somek∈[K].

Lemma38.8.P(F)≤δ/ 2.

The proof is based on a concentration inequality derived for categorical distributions and is left for Exercise 38.21. WhenF does not hold, the true transition kernel is inCτkfor allk, which means thatρ∗≤ρkand

Rˆn=

∑n

t=1

(ρ∗−rAt(St))≤

∑K

k=1

∑

t∈Ek

(ρk−rAt(St)) ︸︷︷︸ R ̃k

.

In the next step we boundR ̃kunder the assumption thatFdoes not hold.

Step 2: Bounding the regret in each phase
Assume thatFdoes not occur and fixk∈[K]. Recall thatvkis a value function
satisfying the Bellman optimality equation in the optimistic MDPMkandρkis
its gain. Hence

ρk=rπk(s)−vk(s) +〈Pk,πk(s),vk〉 for alls∈S. (38.18)

As noted earlier, solutions to the Bellman optimality equation remain solutions
when translated so we may assume without loss of generality thatvkis such that
‖vk‖∞≤span(vk)/2, which means that

‖vk‖∞≤^1 2

span(vk)≤D 2

, (38.19)

where the second inequality follows from Lemma 38.3 and the fact that whenF
does not hold the diameter of the extended MDPM ̃kis at mostDandvkalso
satisfies the Bellman-optimality equation in this MDP. By the definition of the
policy we haveAt=πk(St) fort∈Ek, which implies that

ρk=rAt(St)−vk(St) +〈Pk,At(St),vk〉 for allt∈Ek.

Bandit Algorithms

∑

∑K

∑

.

, (38.19)

Get our desktop app

Company

Features

Documentation

Resources