Bandit Algorithms

38.7 Proof of lower bound 493

The first sum is bounded using a version of Hoeffding–Azuma (Exercise 20.8):

P

(

Fcand

∑n

t=1

(Et[vKt(St+1)]−vKt(St+1))≥D

√

nlog(2/δ) 2

)

≤δ 2

For the second term we note thatT(k)(s,a)/

√

1 ∨Tτk− 1 (s,a)cannot be large too often. A continuous approximation often provides intuition for the correct form. Recalling the thousands of integrals you did at school, for any differentiable f: [0,∞)→Rwe have ∫K

0

√f′(k) f(k)

dk= 2

√

f(K)− 2

√

f(0). (38.21)

Here we are thinking off(k) as the continuous approximation ofTτk− 1 (s,a) and its derivative asT(k)(s,a). In Exercise 38.22 we ask you to make this argument rigorous by showing that ∑K

k=1

T(k)(s,a) √ 1 ∨Tτk− 1 (s,a)

≤

(√

2 + 1

)√

Tn(s,a).

Then by Cauchy-Schwarz and the fact that

∑

s,a∈S×ATn(s,a) =n, ∑

s∈S

∑

a∈A

√

Tn(s,a)≤

√

SAn.

It remains to bound the number of phases. A new phase starts when the visit count for some state-action pair doubles. HenceKcannot be more than the number of times the counters double in total for each of the states. It is easy to see that 1 +log 2 Tn(s,a) gives an upper bound on how many times the counter for this pair may double (the constant 1 is there to account for the counter changing from zero to one). ThusK≤K′=

∑

s,a1 +log^2 Tn(s,a). Noting that 0 ≤Tn(s,a) and

∑

s,aTn(s,a) =nand relaxingTn(s,a) to take real values we find that the value ofK′is the largest whenTn(s,a) =n/(SA), which shows that

K≤SA

(

1 + log 2

(n SA

))

Putting everything together gives the desired result.

38.7 Proof of lower bound

The lower bound is proven by crafting a difficult MDP that models a bandit
with approximately SA arms. This a cumbersome endeavour, but intuitively
straightforward and the explanations that follow should be made clear in Fig. 38.3.
Given S and A, the first step is to construct a tree of minimum depth with at
most A children for each node using exactly S−2 states. The root of the tree is
denoted bys◦and transitions within the tree are deterministic, so in any given
node the learner can simply select which child to transition to. LetLbe the

Bandit Algorithms

P

(

√

)

√

√

√

≤

(√

2 + 1

)√

∑

∑

√

√

∑

∑

(

))

38.7 Proof of lower bound

Get our desktop app

Company

Features

Documentation

Resources