Bandit Algorithms

(Jeff_L) #1
8.1 Asymptotically optimal UCB 113

Furthermore,


lim sup
n→∞

Rn
log(n)



i:∆i> 0

2


∆i

. (8.2)


Choosingε= ∆i/2 inside the sum shows that

Rn≤


i:∆i> 0

(


∆i+^1
∆i

(


8 logf(n) + 8


πlogf(n) + 28

))


. (8.3)


Even more concretely, there exists some universal constantC >0 such that

Rn≤C


i:∆i> 0

(


∆i+

log(n)
∆i

)


,


which by the same argument as in the proof of Theorem 7.2 leads a worst-case
bound ofRn≤C


∑k
i=1∆i+ 2


Cnklog(n).

Taking the limit of the ratio of the bound in(8.3)andlog(n) does not result
in the same constant as in the theorem, which is the main justification for
introducing the more complicated regret bound. You will see in Chapter 15
that the asymptotic bound on the regret given in(8.2)is unimprovable in a
strong sense.

We start with a useful lemma to bound the number of times the index of a
suboptimal arm will be larger than some threshold above its mean.
Lemma8.2.LetX 1 ,...,Xnbe a sequence of independent 1 -subgaussian random
variables,ˆμt=^1 t


∑t
s=1Xs,ε >^0 ,a >^0 and

κ=

∑n

t=1

I


{


ˆμt+


2 a
t

≥ε

}


, κ′=u+

∑n

t=due

I


{


μˆt+


2 a
t

≥ε

}


,


whereu= 2aε−^2. Then it holdsE[κ]≤E[κ′]≤1 +^2
ε^2


(a+


πa+ 1).

The intuition for this result is as follows. Since theXiare 1-subgaussian and
independent we haveE[μˆt] = 0, so we cannot expectμˆt+


2 a/tto be smaller
thanεuntiltis at least 2a/ε^2. The lemma confirms that this is the right order
as an estimate forE[κ].
Proof By Corollary 5.5 we have

E[κ]≤E[κ′] =u+

∑n

t=due

P


(


μˆt+


2 a
t

≥ε

)


≤u+

∑n

t=due

exp


−t

(


ε−


2 a
t

) 2


2






≤1 +u+

∫∞


u

exp


−t

(


ε−


2 a
t

) 2


2



dt= 1 +^2
ε^2

(a+


πa+ 1)
Free download pdf