Bandit Algorithms

10.2 The KL-UCB algorithm 132

second shows that the index of any other arm is not often much larger than the same value. These results mirror those given for UCB, but things are complicated by the non-symmetric and hard-to-invert divergence function. For the next results we defined(p,q) =d(p,q)I{p≤q}.

Lemma10.7.LetX 1 ,X 2 ,...,Xnbe independent Bernoul li random variables with meanμ∈[0,1],ε > 0 and

τ= min

{

t: max 1 ≤s≤n

d(ˆμs,μ−ε)−logf(t) s

≤ 0

}

.

ThenE[τ]≤^2
ε^2

.

Proof We start with a high probability bound and then integrate to control the expectation.

P(τ > t)≤P

(

∃ 1 ≤s≤n:d(ˆμs,μ−ε)>

logf(t) s

)

≤

∑n

s=1

P

(

d(ˆμs,μ−ε)>

logf(t) s

)

=

∑n

s=1

P

(

d(ˆμs,μ−ε)>

logf(t) s

,μˆs< μ−ε

)

≤

∑n

s=1

P

(

d(ˆμs,μ)>

logf(t) s

+ 2ε^2 ,μˆs< μ

)

((c) of Lemma 10.2)

≤

∑n

s=1

exp

(

−s

(

2 ε^2 +

logf(t) s

))

(Eq. (10.3) of Corollary 10.4)

≤^1

f(t)

∑n

s=1

exp

(

− 2 sε^2

)

≤

1

2 f(t)ε^2

.

To finish, we integrate the tail,

E[τ]≤

∫∞

0

P(τ≥t)dt≤

1

2 ε^2

∫∞

0

dt f(t)

≤

2

ε^2

.

Lemma10.8.LetX 1 ,X 2 ,...,Xnbe independent Bernoul li random variables with meanμ. Further, let∆> 0 ,a > 0 and define

κ=

∑n

s=1

I

{

d(ˆμs,μ+ ∆)≤a s

}

.

ThenE[κ]≤ inf
ε∈(0,∆)

(

1 +

a d(μ+ε,μ+ ∆)

+

1

2 ε^2

)

.

Bandit Algorithms

{

≤ 0

}

.

.

(

)

≤

P

(

)

=

P

(

)

≤

P

(

)

≤

(

(

))

≤^1

(

)

≤

1

.

∫∞

1

∫∞

≤

2

.

I

{

}

.

(

1 +

+

1

)

.

Get our desktop app

Company

Features

Documentation

Resources