Bandit Algorithms

10.2 The KL-UCB algorithm 131

10.2 The KL-UCB algorithm

The difference between KL-UCB and UCB is that Chernoff’s bound is used to define the upper confidence bound instead of Lemma 5.5.

1:Input k 2:Choose each arm once 3:Subsequently choose

At= argmaximax

{

μ ̃∈[0,1] :d(ˆμi(t−1),μ ̃)≤

logf(t) Ti(t−1)

}

,

wheref(t) = 1 +tlog^2 (t).

Algorithm 8:KL-UCB.

Theorem10.6. If the reward in roundtisXt∼ B(μAt), then the regret of Algorithm 8 is bounded by

Rn≤

∑

i:∆i> 0 εinf 1 ,ε 2 >^0 ε 1 +ε 2 ∈(0,∆i)

∆i

(

f(n) d(μi+ε 1 ,μ∗−ε 2 )

+

1

2 ε^21

+

1

ε^22

)

.

Furthermore,lim sup n→∞

Rn log(n)≤

∑

i:∆i> 0

∆i d(μi,μ∗).

Comparing the regret in Theorem 10.6 to what would be obtained when using UCB from Chapter 8, which for subgaussian constantσ= 1/2 satisfies

lim sup n→∞

Rn log(n)

≤

∑

i:∆i> 0

1

2∆i

.

By Pinsker’s inequality (part (b) of Lemma 10.2) we see thatd(μi,μ∗) ≥ 2(μ∗−μi)^2 = 2∆^2 i, which means that the asymptotic regret of KL-UCB is never worse than that of UCB. On the other hand, a Taylor’s expansion shows that whenμiandμ∗are close (the hard case in the asymptotic regime),

d(μi,μ∗) =

∆^2 i 2 μi(1−μi)+o(∆

2 i),

indicating that the regret of KL-UCB is approximately

lim sup n→∞

Rn log(n)

≈

∑

i:∆i> 0

2 μi(1−μi) ∆i

. (10.5)

Notice thatμi(1−μi) is the variance of a Bernoulli distribution with meanμi. The approximation indicates that KL-UCB will improve on UCB in regimes whereμiis close to zero or one. The proof of Theorem 10.6 relies on two lemmas. The first is used to show that the index of the optimal arm is never too far below its true value, while the

Bandit Algorithms

10.2 The KL-UCB algorithm

{

}

,

∑

(

+

1

+

1

)

.

∑

≤

∑

1

.

≈

∑

. (10.5)

Get our desktop app

Company

Features

Documentation

Resources