Bandit Algorithms

(Jeff_L) #1
27.2 Regret analysis 302

functionP ̃t:A→[0,1] given by

P ̃t(a)∝exp

(


−η

∑t−^1

s=1

Yˆs(a)

)


,


whereη >0 is the learning rate. To control the variance of the loss estimates,
it will be useful to mix this distribution with an exploration distributionpi
(π:A→[0,1] and


a∈Aπ(a) = 1). The mixture distribution is
Pt(a) = (1−γ)P ̃t(a) +γπ(a),

whereγis a constant mixing factor to be chosen later. The algorithm then simply
samples its actionAtfromPt:

At∼Pt.

Recall thatYt=〈At,yt〉is the observed loss after taking actionAt. We need a
way to estimateyt(a)=.〈a,yt〉. The idea is to use least squares to estimateytwith
Yˆt=RtAtYt, whereRt∈Rd×dis selected so thatYˆtis an unbiased estimate ofyt
given the history. Then the loss for a given action is estimated byYˆt(a) =〈a,Yˆt〉.
To find the choice ofRtthat makesYˆtunbiased letEt[·] =E[·|A 1 ,...,At− 1 ]and
calculate

Et[Yˆt] =RtEt[AtA>t]yt=Rt

(



a∈A

Pt(a)aa>

)


︸ ︷︷ ︸


Qt

yt.

UsingRt=Q−t^1 leads toEt[Yˆt] =ytas desired. Of courseQtshould be non-
singular, which will follow by choosingπso that

Q(π) =


a∈A

π(a)aa>

is non-singular. The complete algorithm is summarized in Algorithm 15.

27.2 Regret analysis


Theorem 27.1. Assume that span(A) = Rd. There exists an exploration
distributionπand parametersηandγsuch that for all(yt)twithyt∈ Lthe
regret of Algorithm 15 is at mostRn≤ 2


3 dnlog(k).

Proof Assume that the learning rateηis chosen so that for each roundtthe
loss estimates satisfy

ηYˆt(a)≥− 1 , ∀a∈A. (27.1)

Then by modifying the proof of Theorem 11.1 (see Exercise 27.1) the regret is
Free download pdf