38.7 Proof of lower bound 493The first sum is bounded using a version of Hoeffding–Azuma (Exercise 20.8):
P
(
Fcand∑nt=1(Et[vKt(St+1)]−vKt(St+1))≥D√
nlog(2/δ)
2)
≤δ
2For the second term we note thatT(k)(s,a)/√
1 ∨Tτk− 1 (s,a)cannot be large
too often. A continuous approximation often provides intuition for the correct
form. Recalling the thousands of integrals you did at school, for any differentiable
f: [0,∞)→Rwe have
∫K0√f′(k)
f(k)dk= 2√
f(K)− 2√
f(0). (38.21)Here we are thinking off(k) as the continuous approximation ofTτk− 1 (s,a) and
its derivative asT(k)(s,a). In Exercise 38.22 we ask you to make this argument
rigorous by showing that
∑Kk=1T(k)(s,a)
√
1 ∨Tτk− 1 (s,a)≤
(√
2 + 1
)√
Tn(s,a).Then by Cauchy-Schwarz and the fact that
∑
s,a∈S×ATn(s,a) =n,
∑s∈S∑
a∈A√
Tn(s,a)≤√
SAn.It remains to bound the number of phases. A new phase starts when the visit
count for some state-action pair doubles. HenceKcannot be more than the
number of times the counters double in total for each of the states. It is easy to
see that 1 +log 2 Tn(s,a) gives an upper bound on how many times the counter
for this pair may double (the constant 1 is there to account for the counter
changing from zero to one). ThusK≤K′=∑
s,a1 +log^2 Tn(s,a). Noting that
0 ≤Tn(s,a) and∑
s,aTn(s,a) =nand relaxingTn(s,a) to take real values we
find that the value ofK′is the largest whenTn(s,a) =n/(SA), which shows thatK≤SA(
1 + log 2(n
SA))
Putting everything together gives the desired result.38.7 Proof of lower bound
The lower bound is proven by crafting a difficult MDP that models a bandit
with approximately SA arms. This a cumbersome endeavour, but intuitively
straightforward and the explanations that follow should be made clear in Fig. 38.3.
Given S and A, the first step is to construct a tree of minimum depth with at
most A children for each node using exactly S−2 states. The root of the tree is
denoted bys◦and transitions within the tree are deterministic, so in any given
node the learner can simply select which child to transition to. LetLbe the