Bandit Algorithms

(Jeff_L) #1
28.4 Linear bandits on the unit ball 317

and assuming thatηYˆt+∇F(a)∈∇F(dom(F))for al la∈Aalmost surely, then
the regret of the mirror descent variation is bounded by


Rn≤

diamF(A)
η

+


1


η

∑n

t=1

E


[


D(A ̄t,A ̃t+1)

]


Proof Using the definition of the algorithm and the assumption thatYˆt is
unbiased givenA ̄tandPthas meanA ̄tleads to

E[〈At,yt〉] =E

[


〈A ̄t,yt〉

]


=E


[


E


[


〈A ̄t,yt〉|A ̄t

]]


=E


[


E


[


〈A ̄t,Yˆt〉

∣∣


∣A ̄t

]]


,


where the last equality used the linearity of expectations. Hence,


Rn(a) =E

[n

t=1

〈At,yt〉−〈a,yt〉

]


=E


[n

t=1

〈A ̄t−a,Yˆt〉

]


,


which is the expected random regret of mirror descent or FTRL on the recursively
constructed sequenceYˆt. The result follows from Theorem 28.4 or Theorem 28.5
and the note at the end of the last section that says these theorems continue to
hold even for recursively constructed sequences.


1:Input Legendre potentialF, action setAand learning rateη > 0
2:ChooseA ̄ 1 = argmina∈A∩dom(F)F(a)
3:fort= 1,...,ndo
4: Choose measurePtonAwith meanA ̄t
5: Sample actionAtfromPtand observe〈At,yt〉
6: Compute estimateYˆt
7: Update:
A ̄t+1= argmina∈A∩dom(F)η〈a,Yˆt〉+DF(a,A ̄t) (Mirror descent)

A ̄t+1= argmina∈A∩dom(F)η

∑t

s=1

〈a,Yˆs〉+F(a) (FTRL)

8:end for
Algorithm 16:Online stochastic mirror descent/FTRL.

28.4 Linear bandits on the unit ball


To illustrate the power of these methods we return to adversarial linear bandits
and the special case where the action set is the unit ball. In the previous chapter we
showed that continuous exponential weights on the unit ball with Kiefer-Wolfowitz
exploration has a regret of


Rn=O(d


nlog(n)).
Free download pdf