Bandit Algorithms

(Jeff_L) #1
10.2 The KL-UCB algorithm 132

second shows that the index of any other arm is not often much larger than the
same value. These results mirror those given for UCB, but things are complicated
by the non-symmetric and hard-to-invert divergence function.
For the next results we defined(p,q) =d(p,q)I{p≤q}.

Lemma10.7.LetX 1 ,X 2 ,...,Xnbe independent Bernoul li random variables with
meanμ∈[0,1],ε > 0 and

τ= min

{


t: max
1 ≤s≤n

d(ˆμs,μ−ε)−logf(t)
s

≤ 0


}


.


ThenE[τ]≤^2
ε^2


.


Proof We start with a high probability bound and then integrate to control the
expectation.

P(τ > t)≤P

(


∃ 1 ≤s≤n:d(ˆμs,μ−ε)>

logf(t)
s

)



∑n

s=1

P


(


d(ˆμs,μ−ε)>

logf(t)
s

)


=


∑n

s=1

P


(


d(ˆμs,μ−ε)>

logf(t)
s

,μˆs< μ−ε

)



∑n

s=1

P


(


d(ˆμs,μ)>

logf(t)
s

+ 2ε^2 ,μˆs< μ

)


((c) of Lemma 10.2)


∑n

s=1

exp

(


−s

(


2 ε^2 +

logf(t)
s

))


(Eq. (10.3) of Corollary 10.4)

≤^1


f(t)

∑n

s=1

exp

(


− 2 sε^2

)



1


2 f(t)ε^2

.


To finish, we integrate the tail,

E[τ]≤

∫∞


0

P(τ≥t)dt≤

1


2 ε^2

∫∞


0

dt
f(t)


2


ε^2

.


Lemma10.8.LetX 1 ,X 2 ,...,Xnbe independent Bernoul li random variables with
meanμ. Further, let∆> 0 ,a > 0 and define

κ=

∑n

s=1

I


{


d(ˆμs,μ+ ∆)≤a
s

}


.


ThenE[κ]≤ inf
ε∈(0,∆)


(


1 +


a
d(μ+ε,μ+ ∆)

+


1


2 ε^2

)


.

Free download pdf