Bandit Algorithms

(Jeff_L) #1
11.4 Regret analysis 151

− 0. 5 0 0. 5

− 0. 1

0

0. 1

x

exp(x)−(1 +x)
exp(x)−(1 +x+x^2 )
exp(x)−(1 +x+x^2 /2)

Figure 11.2Approximations for exp(x) on [− 1 / 2 , 1 /2].

Theorem11.2.Letx∈[0,1]n×kbe an adversarial bandit andπbe the policy of
Exp3 with learning rateη=



2 log(k)/(nk). Then
Rn(π,x)≤


2 nklog(k).

Proof By constructionXˆtj≤1. Therefore

exp

(


ηXˆtj

)


= exp(η) exp

(


η(Xˆtj−1)

)


≤exp(η)

{


1 +η(Xˆtj−1) +η

2
2

(Xˆtj−1)^2

}


.


Using the fact that


jPtj= 1 and the inequality 1 +x≤exp(x) we get

Wt
Wt− 1 =

∑k

j=1

Ptjexp(ηXˆtj)≤exp


η

∑k

j=1

PtjXˆtj+

η^2
2

∑k

j=1

Ptj(Xˆtj−1)^2


,


where the equality is from Eq. (11.11). We see that here we need to bound∑


jPtj(Xˆtj−1)^2. LetYˆtj= 1−Xˆtj. Then
Ptj(Xˆtj−1)^2 =PtjYˆtjYˆtj=I{At=j}ytjYˆtj≤Yˆtj,

where the last inequality usedYˆtj≥0 andytj≤1. Thus,


∑k

j=1

Ptj(Xˆtj−1)^2 ≤

∑k

j=1

Yˆtj.

With the same calculations as before, we get


Sˆni−Sˆn≤log(k)
η


2

∑n

t=1

∑k

j=1

Yˆtj. (11.15)

The result is completed by taking expectations of both sides, usingE



t,jYˆtj=
E


t,jEtYˆtj=E


t,jytj≤nkand substituting the learning rate.
Free download pdf