Bandit Algorithms

33.5 Bibliographical remarks 388

is to find anε-optimal arm with high probability with as few samples as possible.
After a dry spell, the field was restarted by Bubeck et al. [2009], Audibert and
Bubeck [2010b]. The asymptotically optimal algorithm for the fixed confidence
setting of Section 33.2 was introduced by Garivier and Kaufmann [2016], who
also provide results for exponential families as well as in-depth intuition and
historical background. A similar problem is studied in a Bayesian setting by Russo
[2016] who focuses on designing algorithms for which the posterior probability of
choosing a suboptimal arm converges to zero exponentially fast with an optimal
rate. Even more recently, Qin et al. [2017] designed a policy that is optimal in both
the frequentist and Bayesian settings. The stopping rule used by Garivier and
Kaufmann [2016] is inspired by similar rules by Chernoff [1959]. The sequential
halving algorithm is by Karnin et al. [2013] and the best summary of lower bounds
is by Carpentier and Locatelli [2016]. Besides this there have been many other
approaches, with a summary by Jamieson and Nowak [2014]. The negative result
discussed in Note 2 is due to Bubeck et al. [2009]. Pure exploration has recently
become a hot topic and is expanding beyond the finite-armed case. For example,
to linear bandits [Soare et al., 2014] and continuous-armed bandits [Valko et al.,
2013a], tree search [Garivier et al., 2016a, Huang et al., 2017a] and combinatorial
bandits [Chen et al., 2014, Huang et al., 2018].
The continuous-armed case is also known aszeroth-order(or derivative-
free)stochastic optimizationand is studied under various assumptions on the
unknown reward function, usually assuming thatA⊂Rd. Because of the obvious
connection to optimization, this literature usually considers losses, or cost, rather
than reward and the reward function is then called the objective function. A big
part of this literature poses only weak assumptions, such as smoothness, on the
objective function. Note that in the continuous-armed case, regret minimization
may only be marginally more difficult than minimizing the simple regret because
even the instance-dependent simple regret can decay at a slow, polynomial rate.
While the literature is vast, most of it is focused on heuristic methods and their
empirical evaluation and lack a rigorous, finite-time analysis. Methods developed
for this case maintain an approximation to the unknown objective and often use
branch-and-bound techniques to focus the search for the optimal value. For a
taster of the algorithmic ideas see [Conn et al., 2009, Rios and Sahinidis, 2013].
When the search for the optimum is organized cleverly, the methods can adapt to
‘local smoothness’ and enjoy various optimality guarantees [Valko et al., 2013a].
A huge portion of this literature considers the easier problem of finding a local
minimizer, or just a stationary point. Another large portion of this literature
is concerned with the case when the objective function is convex. Chapter 9 of
the classic book by Nemirovsky and Yudin [1983] describes two complementary
approaches (a geometric, and an analytic) and sketches their analysis. For the
class of strongly convex and smooth functions, it is known that the minimax
simple regret is Θ(

√

d^2 /n) [Shamir, 2013]. The main outstanding challenge is to understand the dependence of simple regret on the dimension beyond the strongly convex and smooth case. Hu et al. [2016] prove a lower bound of Ω(n−^1 /^3 ) on

Bandit Algorithms

√

Get our desktop app

Company

Features

Documentation

Resources