Bandit Algorithms

11.4 Regret analysis 151

− 0. 5 0 0. 5

− 0. 1

0

0. 1

x

exp(x)−(1 +x) exp(x)−(1 +x+x^2 ) exp(x)−(1 +x+x^2 /2)

Figure 11.2Approximations for exp(x) on [− 1 / 2 , 1 /2].

Theorem11.2.Letx∈[0,1]n×kbe an adversarial bandit andπbe the policy of
Exp3 with learning rateη=

√

2 log(k)/(nk). Then Rn(π,x)≤

√

2 nklog(k).

Proof By constructionXˆtj≤1. Therefore

exp

(

ηXˆtj

)

= exp(η) exp

(

η(Xˆtj−1)

)

≤exp(η)

{

1 +η(Xˆtj−1) +η

2 2

(Xˆtj−1)^2

}

.

Using the fact that

∑

jPtj= 1 and the inequality 1 +x≤exp(x) we get

Wt Wt− 1 =

∑k

j=1

Ptjexp(ηXˆtj)≤exp



η

∑k

j=1

PtjXˆtj+

η^2 2

∑k

j=1

Ptj(Xˆtj−1)^2



,

where the equality is from Eq. (11.11). We see that here we need to bound∑

jPtj(Xˆtj−1)^2. LetYˆtj= 1−Xˆtj. Then Ptj(Xˆtj−1)^2 =PtjYˆtjYˆtj=I{At=j}ytjYˆtj≤Yˆtj,

where the last inequality usedYˆtj≥0 andytj≤1. Thus,

∑k

j=1

Ptj(Xˆtj−1)^2 ≤

∑k

j=1

Yˆtj.

With the same calculations as before, we get

Sˆni−Sˆn≤log(k) η

+η 2

∑n

t=1

∑k

j=1

Yˆtj. (11.15)

The result is completed by taking expectations of both sides, usingE

∑

t,jYˆtj= E

∑

t,jEtYˆtj=E

∑

t,jytj≤nkand substituting the learning rate.

Bandit Algorithms

√

√

(

)

(

)

{

}

.

∑





,

∑

∑

∑

Get our desktop app

Company

Features

Documentation

Resources