Bandit Algorithms

28.4 Linear bandits on the unit ball 317

and assuming thatηYˆt+∇F(a)∈∇F(dom(F))for al la∈Aalmost surely, then
the regret of the mirror descent variation is bounded by

Rn≤

diamF(A) η

+

1

η

∑n

t=1

E

[

D(A ̄t,A ̃t+1)

]

Proof Using the definition of the algorithm and the assumption thatYˆt is unbiased givenA ̄tandPthas meanA ̄tleads to

E[〈At,yt〉] =E

[

〈A ̄t,yt〉

]

=E

[

E

[

〈A ̄t,yt〉|A ̄t

]]

=E

[

E

[

〈A ̄t,Yˆt〉

∣∣

∣A ̄t

]]

,

where the last equality used the linearity of expectations. Hence,

Rn(a) =E

[n ∑

t=1

〈At,yt〉−〈a,yt〉

]

=E

[n ∑

t=1

〈A ̄t−a,Yˆt〉

]

,

which is the expected random regret of mirror descent or FTRL on the recursively
constructed sequenceYˆt. The result follows from Theorem 28.4 or Theorem 28.5
and the note at the end of the last section that says these theorems continue to
hold even for recursively constructed sequences.

1:Input Legendre potentialF, action setAand learning rateη > 0 2:ChooseA ̄ 1 = argmina∈A∩dom(F)F(a) 3:fort= 1,...,ndo 4: Choose measurePtonAwith meanA ̄t 5: Sample actionAtfromPtand observe〈At,yt〉 6: Compute estimateYˆt 7: Update: A ̄t+1= argmina∈A∩dom(F)η〈a,Yˆt〉+DF(a,A ̄t) (Mirror descent)

A ̄t+1= argmina∈A∩dom(F)η

∑t

s=1

〈a,Yˆs〉+F(a) (FTRL)

8:end for Algorithm 16:Online stochastic mirror descent/FTRL.

28.4 Linear bandits on the unit ball

To illustrate the power of these methods we return to adversarial linear bandits
and the special case where the action set is the unit ball. In the previous chapter we
showed that continuous exponential weights on the unit ball with Kiefer-Wolfowitz
exploration has a regret of

Rn=O(d

√

nlog(n)).

Bandit Algorithms

+

1

E

[

]

[

]

=E

[

E

[

]]

=E

[

E

[

∣∣

]]

,

]

=E

]

,

28.4 Linear bandits on the unit ball

√

Get our desktop app

Company

Features

Documentation

Resources