Bandit Algorithms

(Jeff_L) #1
10.2 The KL-UCB algorithm 131

10.2 The KL-UCB algorithm


The difference between KL-UCB and UCB is that Chernoff’s bound is used to
define the upper confidence bound instead of Lemma 5.5.

1:Input k
2:Choose each arm once
3:Subsequently choose

At= argmaximax

{


μ ̃∈[0,1] :d(ˆμi(t−1),μ ̃)≤

logf(t)
Ti(t−1)

}


,


wheref(t) = 1 +tlog^2 (t).

Algorithm 8:KL-UCB.

Theorem10.6. If the reward in roundtisXt∼ B(μAt), then the regret of
Algorithm 8 is bounded by

Rn≤


i:∆i> 0
εinf
1 ,ε 2 >^0
ε 1 +ε 2 ∈(0,∆i)

∆i

(


f(n)
d(μi+ε 1 ,μ∗−ε 2 )

+


1


2 ε^21

+


1


ε^22

)


.


Furthermore,lim sup
n→∞

Rn
log(n)≤


i:∆i> 0

∆i
d(μi,μ∗).

Comparing the regret in Theorem 10.6 to what would be obtained when using
UCB from Chapter 8, which for subgaussian constantσ= 1/2 satisfies

lim sup
n→∞

Rn
log(n)



i:∆i> 0

1


2∆i

.


By Pinsker’s inequality (part (b) of Lemma 10.2) we see thatd(μi,μ∗) ≥
2(μ∗−μi)^2 = 2∆^2 i, which means that the asymptotic regret of KL-UCB is
never worse than that of UCB. On the other hand, a Taylor’s expansion shows
that whenμiandμ∗are close (the hard case in the asymptotic regime),

d(μi,μ∗) =

∆^2 i
2 μi(1−μi)+o(∆

2
i),

indicating that the regret of KL-UCB is approximately

lim sup
n→∞

Rn
log(n)



i:∆i> 0

2 μi(1−μi)
∆i

. (10.5)


Notice thatμi(1−μi) is the variance of a Bernoulli distribution with meanμi.
The approximation indicates that KL-UCB will improve on UCB in regimes
whereμiis close to zero or one.
The proof of Theorem 10.6 relies on two lemmas. The first is used to show
that the index of the optimal arm is never too far below its true value, while the
Free download pdf