Bandit Algorithms

10.2 The KL-UCB algorithm 133

Proof Letε∈(0,∆) andu=a/d(μ+ε,μ+ ∆). Then

E[κ] =

∑n

s=1

P

(

d(ˆμs,μ+ ∆)≤a s

)

≤

∑n

s=1

P

(

μˆs≥μ+εord(μ+ε,μ+ ∆)≤a s

)

(d(·,μ+ ∆) is decreasing on [0,μ+ ∆])

≤u+

∑n

s=due

P(ˆμs≥μ+ε)

≤u+

∑∞

s=1

exp (−sd(μ+ε,μ)) (Lemma 10.3)

≤

a d(μ+ε,μ+ ∆)+

1

d(μ+ε,μ)

≤

a d(μ+ε,μ+ ∆)

+

1

2 ε^2

(Pinsker’s inequality/Lemma 10.2(b))

as required.

Proof of Theorem 10.6 As in other proofs we assume without loss of generality thatμ 1 =μ∗and boundE[Ti(n)] for suboptimal armsi. To this end, fix a suboptimal armiand letε 1 +ε 2 ∈(0,∆i) with bothε 1 andε 2 positive. Define

τ= min

{

t: max 1 ≤s≤nd(ˆμ 1 s,μ 1 −ε 2 )−

logf(t) s

≤ 0

}

,and

κ=

∑n

s=1

I

{

d(ˆμis,μi+ ∆i−ε 2 )≤

logf(n) s

}

.

By a similar reasoning as in the proof of Theorem 8.1,

E[Ti(n)] =E

[n ∑

t=1

I{At=i}

]

≤E[τ] +E

[ n ∑

t=τ+1

I{At=i}

]

≤E[τ] +E

[n ∑

t=1

I

{

At=iandd(ˆμi,Ti(t−1),μ 1 −ε 2 )≤

logf(t) Ti(t−1)

}]

≤E[τ] +E[κ]

≤

2

ε^22

+

f(n) d(μi+ε 1 ,μ∗−ε 2 )

+

1

2 ε^21

,

where the second inequality follows since by the definition ofτ, ift > τ, then
the index of the optimal arm is at least as large asμ 1 −ε 2. The third inequality
follows from the definition ofκas in the proof of Theorem 8.1. The final inequality

Bandit Algorithms

P

(

)

≤

P

(

)

∑∞

≤

1

+

1

{

≤ 0

}

I

{

}

.

]

]

I

{

}]

2

+

+

1

,

Get our desktop app

Company

Features

Documentation

Resources