Bandit Algorithms

8.1 Asymptotically optimal UCB 113

Furthermore,

lim sup n→∞

Rn log(n)

≤

∑

i:∆i> 0

2

∆i

. (8.2)

Choosingε= ∆i/2 inside the sum shows that

Rn≤

∑

i:∆i> 0

(

∆i+^1 ∆i

(

8 logf(n) + 8

√

πlogf(n) + 28

))

. (8.3)

Even more concretely, there exists some universal constantC >0 such that

Rn≤C

∑

i:∆i> 0

(

∆i+

log(n) ∆i

)

,

which by the same argument as in the proof of Theorem 7.2 leads a worst-case
bound ofRn≤C

∑k i=1∆i+ 2

√

Cnklog(n).

Taking the limit of the ratio of the bound in(8.3)andlog(n) does not result in the same constant as in the theorem, which is the main justification for introducing the more complicated regret bound. You will see in Chapter 15 that the asymptotic bound on the regret given in(8.2)is unimprovable in a strong sense.

We start with a useful lemma to bound the number of times the index of a
suboptimal arm will be larger than some threshold above its mean.
Lemma8.2.LetX 1 ,...,Xnbe a sequence of independent 1 -subgaussian random
variables,ˆμt=^1 t

∑t s=1Xs,ε >^0 ,a >^0 and

κ=

∑n

t=1

I

{

ˆμt+

√

2 a t

≥ε

}

, κ′=u+

∑n

t=due

I

{

μˆt+

√

2 a t

≥ε

}

,

whereu= 2aε−^2. Then it holdsE[κ]≤E[κ′]≤1 +^2
ε^2

(a+

√

πa+ 1).

The intuition for this result is as follows. Since theXiare 1-subgaussian and independent we haveE[μˆt] = 0, so we cannot expectμˆt+

√

2 a/tto be smaller thanεuntiltis at least 2a/ε^2. The lemma confirms that this is the right order as an estimate forE[κ]. Proof By Corollary 5.5 we have

E[κ]≤E[κ′] =u+

∑n

t=due

P

(

μˆt+

√

2 a t

≥ε

)

≤u+

∑n

t=due

exp



−t

(

ε−

√

2 a t

) 2

2





≤1 +u+

∫∞

u

exp



−t

(

ε−

√

2 a t

) 2

2



dt= 1 +^2 ε^2

(a+

√

πa+ 1)

Bandit Algorithms

≤

∑

2

. (8.2)

∑

(

(

√

))

. (8.3)

∑

(

)

,

√

I

{

√

}

I

{

√

}

,

√

√

P

(

√

)



(

√

) 2

2





∫∞



(

√

) 2

2



√

Get our desktop app

Company

Features

Documentation

Resources