Bandit Algorithms

(Jeff_L) #1
10.2 The KL-UCB algorithm 133

Proof Letε∈(0,∆) andu=a/d(μ+ε,μ+ ∆). Then

E[κ] =

∑n

s=1

P


(


d(ˆμs,μ+ ∆)≤a
s

)



∑n

s=1

P


(


μˆs≥μ+εord(μ+ε,μ+ ∆)≤a
s

)


(d(·,μ+ ∆) is decreasing on [0,μ+ ∆])

≤u+

∑n

s=due

P(ˆμs≥μ+ε)

≤u+

∑∞


s=1

exp (−sd(μ+ε,μ)) (Lemma 10.3)


a
d(μ+ε,μ+ ∆)+

1


d(μ+ε,μ)


a
d(μ+ε,μ+ ∆)

+


1


2 ε^2

(Pinsker’s inequality/Lemma 10.2(b))

as required.

Proof of Theorem 10.6 As in other proofs we assume without loss of generality
thatμ 1 =μ∗and boundE[Ti(n)] for suboptimal armsi. To this end, fix a
suboptimal armiand letε 1 +ε 2 ∈(0,∆i) with bothε 1 andε 2 positive. Define

τ= min

{


t: max 1 ≤s≤nd(ˆμ 1 s,μ 1 −ε 2 )−

logf(t)
s

≤ 0


}


,and

κ=

∑n

s=1

I


{


d(ˆμis,μi+ ∆i−ε 2 )≤

logf(n)
s

}


.


By a similar reasoning as in the proof of Theorem 8.1,

E[Ti(n)] =E

[n

t=1

I{At=i}

]


≤E[τ] +E

[ n

t=τ+1

I{At=i}

]


≤E[τ] +E

[n

t=1

I


{


At=iandd(ˆμi,Ti(t−1),μ 1 −ε 2 )≤

logf(t)
Ti(t−1)

}]


≤E[τ] +E[κ]


2


ε^22

+


f(n)
d(μi+ε 1 ,μ∗−ε 2 )

+


1


2 ε^21

,


where the second inequality follows since by the definition ofτ, ift > τ, then
the index of the optimal arm is at least as large asμ 1 −ε 2. The third inequality
follows from the definition ofκas in the proof of Theorem 8.1. The final inequality

Free download pdf