Bandit Algorithms

28.1 Online linear optimization 311

occurs in the interior ofD ⊆Aand that∇Φt(at+1) = 0 (see Exercise 28.1). This means thatηyt=∇F(at)−∇F(at+1) and so

∇F(at+1) =−ηyt+∇F(at) =∇F(a 1 )−η

∑t

s=1

ys=−η

∑t

s=1

ys,

where the last equality is true becausea 1 is chosen as the minimizer ofF in
A∩D=Dand again the fact thatFis Legendre ensures this minimum occurs
at an interior point where the gradient vanishes. Follow the regularized leader
choosesat+1to minimize Φ′t(a) =η

∑t s=1〈a,ys〉+F(a). The same argument shows that∇Φ′t(at+1) = 0, which means that

∇F(at+1) =−η

∑t

s=1

ys.

The last two displays and the fact that the gradient for Legendre functions is
invertible shows that mirror descent and follow the regularized leader are the
same in this setting.

The equivalence between these algorithms is far from universal. First of all, it does not generally hold whenF is not Legendre or its domain is larger thanA. Second, in many applications of these algorithms the learning rate or potential change with time and in either case the algorithms will typically produce different action sequences. For example, if a learning rateηtis used rather thanηin the definition of Φt, then mirror descent chooses∇F(at+1) =−

∑t s=1ηsyswhile follow the regularized leader chooses ∇F(at+1) =−ηt

∑t s=1ys. We return to this issue in the notes and exercises.

Example 28.1. LetA = Rd andF(a) =^12 ‖a‖^22. Then ∇F(a) = a and D(a,at) =^12 ‖a−at‖^22. ClearlyF is Legendre andD=Aso mirror descent and follow the regularized leader are the same. By simple calculus we see that

at+1= argmina∈Rdη〈a,yt〉+

1

2 ‖a−at‖

2 2 =at−ηyt,

which may be familiar asonline gradient descent.

Example28.2. LetAbe a compact convex subset ofRdandF(a) =^12 ‖a‖^22.
Then mirror descent chooses

at+1= argmina∈Aη〈a,yt〉+

1

2

‖a−at‖^22 = Π(at−ηyt), (28.4)

where Π(a) is the Euclidean projection ofaontoA. This algorithm is usually called
online projected gradient descent. On the other hand, for follow the regularized
leader we have

at+1= argmina∈Aη

∑t

s=1

〈a,ys〉+

1

2

‖a−at‖^22 = Π

(

−η

∑t

s=1

ys

)

,

Bandit Algorithms

1

1

2

1

2

(

)

,

Get our desktop app

Company

Features

Documentation

Resources