Bandit Algorithms

27.2 Regret analysis 303

1:Input Action setA ⊂ Rd, learning rateη, exploration distributionπ, exploration parameterγ 2:fort= 1, 2 ,...,ndo 3: Compute sampling distribution:

Pt(a) =γπ(a) + (1−γ)

exp

(

−η

∑t− 1 s=1Yˆs(a)

)

∑

a′∈Aexp

(

−η

∑t− 1 s=1Yˆs(a′)

).

4: Sample actionAt∼Pt 5: Observe lossYt=〈At,yt〉and compute loss estimates: Yˆt=Q−t^1 AtYt and Yˆt(a) =〈a,Yˆt〉.

6:end for Algorithm 15:Exp3 for linear bandits.

bounded by

Rn≤

logk η

+ 2γn+η

∑n

t=1

E

[

∑

a∈A

Pt(a)Yˆt^2 (a)

]

. (27.2)

Note that we cannot use the proof that leads to the tighter constant (ηgetting replaced byη/2 in the second term above) because there is no guarantee that the loss estimates are upper bounded by one. To get a regret bound it remains to setγandηso that(27.1)is satisfied and to boundE

[∑

aPt(a)Yˆt^2 (a)

]

. We start

with the latter. LetMt=

∑

aPt(a)Yˆ t^2 (a). By the definition of the loss estimate,

Yˆt^2 (a) = (a>Q−t^1 AtYt)^2 =Yt^2 A>tQ−t^1 aa>Q−t^1 At,

which means that Mt =

∑

aPt(a)Yˆt^2 (a) = Yt^2 A>tQ

− 1 t At ≤ A>tQ

− 1 t At = trace(AtA>tQ−t^1 ) and by the linearity of trace,

Et[Mt]≤trace

(

∑

a∈A

Pt(a)aa>Q−t^1

)

=d.

It remains to chooseγandη. Strengthen(27.1)to|ηYˆt(a)| ≤1 and note that since|Yt|≤1,

|ηYˆt(a)|=|ηa>Q−t^1 AtYt|≤η|a>Q−t^1 At|.

Recall thatQ(π) =

∑

a∈Aπ(a)aa

. ClearlyQtγQ(π) and henceQ−t (^1)
Q(π)−^1 /γby Exercise 27.4. Using this and the Cauchy-Schwarz inequality shows
that
|a>Q−t^1 At|≤‖a‖Q−t 1 ‖At‖Q−t 1 ≤maxν∈Aν>Q−t^1 ν≤

1

γ

maxν∈Aν>Q−^1 (π)ν ,

Bandit Algorithms

(

)

∑

(

).

E

[

∑

]

. (27.2)

[∑

]

∑

∑

(

∑

)

∑

1

Get our desktop app

Company

Features

Documentation

Resources