Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

28 Follow the Regularized Leader and Mirror Descent

In the last chapter we showed that ifA⊂Rdhaskelements, then the regret of Exp3 with a careful exploration distribution has regret

Rn=O(

√

dnlog(k)).

WhenAis a convex set we also showed the continuous version of this algorithm has regret at most

Rn=O(d

√

nlog(n)).

Although this algorithm can often be made to run in polynomial time, the degree is high and the implementation is complicated and impractical. In many cases this can be improved, both in terms of the regret and computation. In this chapter we show that whenAis the unit ball there is an efficient, low-complexity algorithm for which the regret isRn=O(

√

dnlog(n)). More importantly, however, we introduce a pair of related algorithms calledfollow the regularized leader andmirror descent, which have proven to be powerful tools for the design and analysis of bandit algorithms. In fact, the exponential weights algorithm turns out to be a special case.

28.1 Online linear optimization

Mirror descent originated in the convex optimization literature. The idea has since been adapted to online learning and specifically to online linear optimization. Online linear optimization is the full information version of the adversarial linear bandit where at the end of each round the learner observes the full vectoryt. LetA ⊂Rdbe a convex set andL ⊂Rdbe an arbitrary set called theloss space. Lety 1 ,...,ynbe a sequence of loss vectors withyt∈Lfor allt∈[n]. In each round the learner choosesat∈Aand subsequently observesyt. The regret relative to a fixed comparatora∈Ais

Rn(a) =

∑n

t=1

〈at−a,yt〉

and the regret isRn=maxa∈ARn(a). We emphasize that the only difference relative to the adversarial linear bandit is that nowytis observed rather than

Bandit Algorithms

28 Follow the Regularized Leader and Mirror Descent

√

√

√

28.1 Online linear optimization

Get our desktop app

Company

Features

Documentation

Resources