Bandit Algorithms

12.2 Regret analysis 160

1:Input: n,k,η,γ 2:SetLˆ 0 i= 0 for alli 3:fort= 1,...,ndo 4: Calculate the sampling distributionPt:

Pti=

exp

(

−ηLˆt− 1 ,i

)

∑k j=1exp

(

−ηLˆt− 1 ,j

)

5: SampleAt∼Ptand observe rewardXt 6: CalculateLˆti=Lˆt− 1 ,i+

I{At=i}(1−Xt) Pt− 1 ,i+γ 7:end for Algorithm 10:Exp3-IX.

12.2 Regret analysis

We now prove the following theorem bounding the random regret of Exp3-IX with high probability.

Theorem12.1.Letδ∈(0,1)and define

η 1 =

√

2 log(k+ 1) nk

and η 2 =

√

log(k) + log(k+1δ ) nk

.

The following statements hold: 1 If Exp3-IX is run with parametersη=η 1 andγ=η/ 2 , then

P

(

Rˆn≥√ 8 nklog(k+ 1) +

√

nk 2 log(k+ 1)

log(1/δ) + log

k+ 1 δ

)

≤δ.

(12.5) 2 If Exp3-IX is run with parametersη=η 2 andγ=η/ 2 , then

P

(

Rˆn≥ 2 √(2 log(k+ 1) + log(1/δ))nk+ log

(

k+ 1 δ

))

≤δ. (12.6)

The value ofη 1 is independent ofδ, which means that using this choice of learning rate leads to a single algorithm with a high probability bound for allδ. On the other hand,η 2 does depend onδso the user must choose a confidence level from the beginning. The advantage is that the bound is improved, but only for the specified confidence level. We will show in Chapter 17 that this tradeoff is unavoidable.

The proof follows by bounding each of the terms in Eq. (12.3), which we do via a series of lemmas. The first of these lemmas is a new concentration bound, the

Bandit Algorithms

(

)

(

)

12.2 Regret analysis

√

√

.

P

(

√

)

(

(

))

Get our desktop app

Company

Features

Documentation

Resources