Bandit Algorithms

(Jeff_L) #1
27.2 Regret analysis 303

1:Input Action setA ⊂ Rd, learning rateη, exploration distributionπ,
exploration parameterγ
2:fort= 1, 2 ,...,ndo
3: Compute sampling distribution:

Pt(a) =γπ(a) + (1−γ)

exp

(


−η

∑t− 1
s=1Yˆs(a)

)



a′∈Aexp

(


−η

∑t− 1
s=1Yˆs(a′)

).


4: Sample actionAt∼Pt
5: Observe lossYt=〈At,yt〉and compute loss estimates:
Yˆt=Q−t^1 AtYt and Yˆt(a) =〈a,Yˆt〉.

6:end for
Algorithm 15:Exp3 for linear bandits.

bounded by

Rn≤

logk
η

+ 2γn+η

∑n

t=1

E


[



a∈A

Pt(a)Yˆt^2 (a)

]


. (27.2)


Note that we cannot use the proof that leads to the tighter constant (ηgetting
replaced byη/2 in the second term above) because there is no guarantee that the
loss estimates are upper bounded by one. To get a regret bound it remains to
setγandηso that(27.1)is satisfied and to boundE

[∑


aPt(a)Yˆt^2 (a)

]


. We start


with the latter. LetMt=



aPt(a)Yˆ
t^2 (a). By the definition of the loss estimate,

Yˆt^2 (a) = (a>Q−t^1 AtYt)^2 =Yt^2 A>tQ−t^1 aa>Q−t^1 At,

which means that Mt =



aPt(a)Yˆt^2 (a) = Yt^2 A>tQ

− 1
t At ≤ A>tQ

− 1
t At =
trace(AtA>tQ−t^1 ) and by the linearity of trace,

Et[Mt]≤trace

(



a∈A

Pt(a)aa>Q−t^1

)


=d.

It remains to chooseγandη. Strengthen(27.1)to|ηYˆt(a)| ≤1 and note that
since|Yt|≤1,

|ηYˆt(a)|=|ηa>Q−t^1 AtYt|≤η|a>Q−t^1 At|.

Recall thatQ(π) =


a∈Aπ(a)aa


. ClearlyQtγQ(π) and henceQ−t (^1) 
Q(π)−^1 /γby Exercise 27.4. Using this and the Cauchy-Schwarz inequality shows
that
|a>Q−t^1 At|≤‖a‖Q−t 1 ‖At‖Q−t 1 ≤maxν∈Aν>Q−t^1 ν≤



1


γ

maxν∈Aν>Q−^1 (π)ν ,
Free download pdf