This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]
28 Follow the Regularized Leader and Mirror Descent
In the last chapter we showed that ifA⊂Rdhaskelements, then the regret of
Exp3 with a careful exploration distribution has regret
Rn=O(
√
dnlog(k)).
WhenAis a convex set we also showed the continuous version of this algorithm
has regret at most
Rn=O(d
√
nlog(n)).
Although this algorithm can often be made to run in polynomial time, the degree
is high and the implementation is complicated and impractical. In many cases this
can be improved, both in terms of the regret and computation. In this chapter we
show that whenAis the unit ball there is an efficient, low-complexity algorithm
for which the regret isRn=O(
√
dnlog(n)). More importantly, however, we
introduce a pair of related algorithms calledfollow the regularized leader
andmirror descent, which have proven to be powerful tools for the design and
analysis of bandit algorithms. In fact, the exponential weights algorithm turns
out to be a special case.
28.1 Online linear optimization
Mirror descent originated in the convex optimization literature. The idea has since
been adapted to online learning and specifically to online linear optimization.
Online linear optimization is the full information version of the adversarial linear
bandit where at the end of each round the learner observes the full vectoryt.
LetA ⊂Rdbe a convex set andL ⊂Rdbe an arbitrary set called theloss
space. Lety 1 ,...,ynbe a sequence of loss vectors withyt∈Lfor allt∈[n]. In
each round the learner choosesat∈Aand subsequently observesyt. The regret
relative to a fixed comparatora∈Ais
Rn(a) =
∑n
t=1
〈at−a,yt〉
and the regret isRn=maxa∈ARn(a). We emphasize that the only difference
relative to the adversarial linear bandit is that nowytis observed rather than