27.2 Regret analysis 3031:Input Action setA ⊂ Rd, learning rateη, exploration distributionπ,
exploration parameterγ
2:fort= 1, 2 ,...,ndo
3: Compute sampling distribution:Pt(a) =γπ(a) + (1−γ)exp(
−η∑t− 1
s=1Yˆs(a))
∑
a′∈Aexp(
−η∑t− 1
s=1Yˆs(a′)).
4: Sample actionAt∼Pt
5: Observe lossYt=〈At,yt〉and compute loss estimates:
Yˆt=Q−t^1 AtYt and Yˆt(a) =〈a,Yˆt〉.6:end for
Algorithm 15:Exp3 for linear bandits.bounded byRn≤logk
η+ 2γn+η∑nt=1E
[
∑
a∈APt(a)Yˆt^2 (a)]
. (27.2)
Note that we cannot use the proof that leads to the tighter constant (ηgetting
replaced byη/2 in the second term above) because there is no guarantee that the
loss estimates are upper bounded by one. To get a regret bound it remains to
setγandηso that(27.1)is satisfied and to boundE[∑
aPt(a)Yˆt^2 (a)]
. We start
with the latter. LetMt=
∑
aPt(a)Yˆ
t^2 (a). By the definition of the loss estimate,Yˆt^2 (a) = (a>Q−t^1 AtYt)^2 =Yt^2 A>tQ−t^1 aa>Q−t^1 At,which means that Mt =
∑
aPt(a)Yˆt^2 (a) = Yt^2 A>tQ− 1
t At ≤ A>tQ− 1
t At =
trace(AtA>tQ−t^1 ) and by the linearity of trace,Et[Mt]≤trace(
∑
a∈APt(a)aa>Q−t^1)
=d.It remains to chooseγandη. Strengthen(27.1)to|ηYˆt(a)| ≤1 and note that
since|Yt|≤1,|ηYˆt(a)|=|ηa>Q−t^1 AtYt|≤η|a>Q−t^1 At|.Recall thatQ(π) =∑
a∈Aπ(a)aa
. ClearlyQtγQ(π) and henceQ−t (^1)
Q(π)−^1 /γby Exercise 27.4. Using this and the Cauchy-Schwarz inequality shows
that
|a>Q−t^1 At|≤‖a‖Q−t 1 ‖At‖Q−t 1 ≤maxν∈Aν>Q−t^1 ν≤
1
γmaxν∈Aν>Q−^1 (π)ν ,