Bandit Algorithms

27.2 Regret analysis 302

functionP ̃t:A→[0,1] given by

P ̃t(a)∝exp

(

−η

∑t−^1

s=1

Yˆs(a)

)

,

whereη >0 is the learning rate. To control the variance of the loss estimates, it will be useful to mix this distribution with an exploration distributionpi (π:A→[0,1] and

∑

a∈Aπ(a) = 1). The mixture distribution is Pt(a) = (1−γ)P ̃t(a) +γπ(a),

whereγis a constant mixing factor to be chosen later. The algorithm then simply samples its actionAtfromPt:

At∼Pt.

Recall thatYt=〈At,yt〉is the observed loss after taking actionAt. We need a way to estimateyt(a)=.〈a,yt〉. The idea is to use least squares to estimateytwith Yˆt=RtAtYt, whereRt∈Rd×dis selected so thatYˆtis an unbiased estimate ofyt given the history. Then the loss for a given action is estimated byYˆt(a) =〈a,Yˆt〉. To find the choice ofRtthat makesYˆtunbiased letEt[·] =E[·|A 1 ,...,At− 1 ]and calculate

Et[Yˆt] =RtEt[AtA>t]yt=Rt

(

∑

a∈A

Pt(a)aa>

)

︸︷︷︸

Qt

yt.

UsingRt=Q−t^1 leads toEt[Yˆt] =ytas desired. Of courseQtshould be non- singular, which will follow by choosingπso that

Q(π) =

∑

a∈A

π(a)aa>

is non-singular. The complete algorithm is summarized in Algorithm 15.

27.2 Regret analysis

Theorem 27.1. Assume that span(A) = Rd. There exists an exploration distributionπand parametersηandγsuch that for all(yt)twithyt∈ Lthe regret of Algorithm 15 is at mostRn≤ 2

√

3 dnlog(k).

Proof Assume that the learning rateηis chosen so that for each roundtthe loss estimates satisfy

ηYˆt(a)≥− 1 , ∀a∈A. (27.1)

Then by modifying the proof of Theorem 11.1 (see Exercise 27.1) the regret is

Bandit Algorithms

(

)

,

∑

(

∑

)

︸︷︷︸

∑

27.2 Regret analysis

√

Get our desktop app

Company

Features

Documentation

Resources

Bandit Algorithms

(

)

,

∑

(

∑

)

︸ ︷︷ ︸

∑

27.2 Regret analysis

√

Get our desktop app

Company

Features

Documentation

Resources

︸︷︷︸