27.2 Regret analysis 302
functionP ̃t:A→[0,1] given by
P ̃t(a)∝exp
(
−η
∑t−^1
s=1
Yˆs(a)
)
,
whereη >0 is the learning rate. To control the variance of the loss estimates,
it will be useful to mix this distribution with an exploration distributionpi
(π:A→[0,1] and
∑
a∈Aπ(a) = 1). The mixture distribution is
Pt(a) = (1−γ)P ̃t(a) +γπ(a),
whereγis a constant mixing factor to be chosen later. The algorithm then simply
samples its actionAtfromPt:
At∼Pt.
Recall thatYt=〈At,yt〉is the observed loss after taking actionAt. We need a
way to estimateyt(a)=.〈a,yt〉. The idea is to use least squares to estimateytwith
Yˆt=RtAtYt, whereRt∈Rd×dis selected so thatYˆtis an unbiased estimate ofyt
given the history. Then the loss for a given action is estimated byYˆt(a) =〈a,Yˆt〉.
To find the choice ofRtthat makesYˆtunbiased letEt[·] =E[·|A 1 ,...,At− 1 ]and
calculate
Et[Yˆt] =RtEt[AtA>t]yt=Rt
(
∑
a∈A
Pt(a)aa>
)
︸ ︷︷ ︸
Qt
yt.
UsingRt=Q−t^1 leads toEt[Yˆt] =ytas desired. Of courseQtshould be non-
singular, which will follow by choosingπso that
Q(π) =
∑
a∈A
π(a)aa>
is non-singular. The complete algorithm is summarized in Algorithm 15.
27.2 Regret analysis
Theorem 27.1. Assume that span(A) = Rd. There exists an exploration
distributionπand parametersηandγsuch that for all(yt)twithyt∈ Lthe
regret of Algorithm 15 is at mostRn≤ 2
√
3 dnlog(k).
Proof Assume that the learning rateηis chosen so that for each roundtthe
loss estimates satisfy
ηYˆt(a)≥− 1 , ∀a∈A. (27.1)
Then by modifying the proof of Theorem 11.1 (see Exercise 27.1) the regret is