Bandit Algorithms

(Jeff_L) #1
38.7 Proof of lower bound 493

The first sum is bounded using a version of Hoeffding–Azuma (Exercise 20.8):


P


(


Fcand

∑n

t=1

(Et[vKt(St+1)]−vKt(St+1))≥D


nlog(2/δ)
2

)


≤δ
2

For the second term we note thatT(k)(s,a)/


1 ∨Tτk− 1 (s,a)cannot be large
too often. A continuous approximation often provides intuition for the correct
form. Recalling the thousands of integrals you did at school, for any differentiable
f: [0,∞)→Rwe have
∫K

0

√f′(k)
f(k)

dk= 2


f(K)− 2


f(0). (38.21)

Here we are thinking off(k) as the continuous approximation ofTτk− 1 (s,a) and
its derivative asT(k)(s,a). In Exercise 38.22 we ask you to make this argument
rigorous by showing that
∑K

k=1

T(k)(s,a)

1 ∨Tτk− 1 (s,a)


(√


2 + 1


)√


Tn(s,a).

Then by Cauchy-Schwarz and the fact that



s,a∈S×ATn(s,a) =n,

s∈S


a∈A


Tn(s,a)≤


SAn.

It remains to bound the number of phases. A new phase starts when the visit
count for some state-action pair doubles. HenceKcannot be more than the
number of times the counters double in total for each of the states. It is easy to
see that 1 +log 2 Tn(s,a) gives an upper bound on how many times the counter
for this pair may double (the constant 1 is there to account for the counter
changing from zero to one). ThusK≤K′=


s,a1 +log^2 Tn(s,a). Noting that
0 ≤Tn(s,a) and


s,aTn(s,a) =nand relaxingTn(s,a) to take real values we
find that the value ofK′is the largest whenTn(s,a) =n/(SA), which shows that

K≤SA

(


1 + log 2

(n
SA

))


Putting everything together gives the desired result.

38.7 Proof of lower bound


The lower bound is proven by crafting a difficult MDP that models a bandit
with approximately SA arms. This a cumbersome endeavour, but intuitively
straightforward and the explanations that follow should be made clear in Fig. 38.3.
Given S and A, the first step is to construct a tree of minimum depth with at
most A children for each node using exactly S−2 states. The root of the tree is
denoted bys◦and transitions within the tree are deterministic, so in any given
node the learner can simply select which child to transition to. LetLbe the

Free download pdf