Bandit Algorithms

(Jeff_L) #1
12.2 Regret analysis 160

1:Input: n,k,η,γ
2:SetLˆ 0 i= 0 for alli
3:fort= 1,...,ndo
4: Calculate the sampling distributionPt:

Pti=

exp

(


−ηLˆt− 1 ,i

)


∑k
j=1exp

(


−ηLˆt− 1 ,j

)


5: SampleAt∼Ptand observe rewardXt
6: CalculateLˆti=Lˆt− 1 ,i+

I{At=i}(1−Xt)
Pt− 1 ,i+γ
7:end for
Algorithm 10:Exp3-IX.

12.2 Regret analysis


We now prove the following theorem bounding the random regret of Exp3-IX
with high probability.

Theorem12.1.Letδ∈(0,1)and define

η 1 =


2 log(k+ 1)
nk

and η 2 =


log(k) + log(k+1δ )
nk

.


The following statements hold:
1 If Exp3-IX is run with parametersη=η 1 andγ=η/ 2 , then

P


(


Rˆn≥√ 8 nklog(k+ 1) +


nk
2 log(k+ 1)

log(1/δ) + log

k+ 1
δ

)


≤δ.

(12.5)
2 If Exp3-IX is run with parametersη=η 2 andγ=η/ 2 , then

P

(


Rˆn≥ 2 √(2 log(k+ 1) + log(1/δ))nk+ log

(


k+ 1
δ

))


≤δ. (12.6)

The value ofη 1 is independent ofδ, which means that using this choice
of learning rate leads to a single algorithm with a high probability bound
for allδ. On the other hand,η 2 does depend onδso the user must choose
a confidence level from the beginning. The advantage is that the bound
is improved, but only for the specified confidence level. We will show in
Chapter 17 that this tradeoff is unavoidable.

The proof follows by bounding each of the terms in Eq. (12.3), which we do via
a series of lemmas. The first of these lemmas is a new concentration bound, the
Free download pdf