27.2 Regret analysis 303
1:Input Action setA ⊂ Rd, learning rateη, exploration distributionπ,
exploration parameterγ
2:fort= 1, 2 ,...,ndo
3: Compute sampling distribution:
Pt(a) =γπ(a) + (1−γ)
exp
(
−η
∑t− 1
s=1Yˆs(a)
)
∑
a′∈Aexp
(
−η
∑t− 1
s=1Yˆs(a′)
).
4: Sample actionAt∼Pt
5: Observe lossYt=〈At,yt〉and compute loss estimates:
Yˆt=Q−t^1 AtYt and Yˆt(a) =〈a,Yˆt〉.
6:end for
Algorithm 15:Exp3 for linear bandits.
bounded by
Rn≤
logk
η
+ 2γn+η
∑n
t=1
E
[
∑
a∈A
Pt(a)Yˆt^2 (a)
]
. (27.2)
Note that we cannot use the proof that leads to the tighter constant (ηgetting
replaced byη/2 in the second term above) because there is no guarantee that the
loss estimates are upper bounded by one. To get a regret bound it remains to
setγandηso that(27.1)is satisfied and to boundE
[∑
aPt(a)Yˆt^2 (a)
]
. We start
with the latter. LetMt=
∑
aPt(a)Yˆ
t^2 (a). By the definition of the loss estimate,
Yˆt^2 (a) = (a>Q−t^1 AtYt)^2 =Yt^2 A>tQ−t^1 aa>Q−t^1 At,
which means that Mt =
∑
aPt(a)Yˆt^2 (a) = Yt^2 A>tQ
− 1
t At ≤ A>tQ
− 1
t At =
trace(AtA>tQ−t^1 ) and by the linearity of trace,
Et[Mt]≤trace
(
∑
a∈A
Pt(a)aa>Q−t^1
)
=d.
It remains to chooseγandη. Strengthen(27.1)to|ηYˆt(a)| ≤1 and note that
since|Yt|≤1,
|ηYˆt(a)|=|ηa>Q−t^1 AtYt|≤η|a>Q−t^1 At|.
Recall thatQ(π) =
∑
a∈Aπ(a)aa
. ClearlyQtγQ(π) and henceQ−t (^1)
Q(π)−^1 /γby Exercise 27.4. Using this and the Cauchy-Schwarz inequality shows
that
|a>Q−t^1 At|≤‖a‖Q−t 1 ‖At‖Q−t 1 ≤maxν∈Aν>Q−t^1 ν≤
1
γ
maxν∈Aν>Q−^1 (π)ν ,