Bandit Algorithms

33.5 Bibliographical remarks 389

the simple-regret for algorithms that construct gradient estimates by injecting
random noise (as is done by [Katkovnik and Kulchitsky, 1972, Nemirovsky and
Yudin, 1983] and others), which, together with theO(n−^1 /^2 ) upper bound by
Nemirovsky and Yudin [1983] (see also Agarwal et al. 2013, Liang et al. 2014),
establishes the inferiority of this approach in thendregime. Interestingly,
empirical evidence favors these gradient based techniques in comparison to the
‘optimal algorithms’. Thus, much room remains to improve our understanding of
this problem. This setting is to be contrasted to the one whenunbiased, noisy
estimates of the gradient can be obtained where methods, such as mirror descent
(see Chapter 28), give optimal rates. This is a much better understood problem
with matching lower and upper bounds available on the minimax simple regret
for various settings (for example, Chapter 5 of Nemirovsky and Yudin [1983], or
Rakhlin et al. [2012]).

Variants of the pure exploration problem are studied in a branch of statistics calledranking and selection. The earliest literature on ranking and selection goes back to at least the 1950s. A relatively recent paper that gives a glimpse into a small corner of this literature is by Chan and Lai [2006]. The reason we cite this paper is because it is particularly relevant for this chapter. Using our terminology, Chan and Lai consider the PAC setting in the parametric setting when the distributions underlying the arms belong to some known exponential family of distributions. A procedure that is similar to the track-and-stop procedure considered here is shown to be both sound and asymptotically optimal as the confidence parameter approaches one. We also like the short and readable review of the literature up to the 1980s from the perspective of simulation optimization by Goldsman [1983].

A related setting studied mostly in the operations research community is ordinal optimization. In its simplest form, ordinal optimization is concerned with finding an arm amongst theαkarms with the highest payoffs. Ho et al. [1992], who defined this problem in the stochastic simulation optimization literature, emphasized that the probability of failing to find one of the ‘good arms’ decays exponentially with the number of observationsnper arm, in contrast to the slown−^1 /^2 decay of the error of estimating the value of the best arm, which this literature calls the problem ofcardinal optimization. Given the results in this chapter, this should not be too surprising. A nice twist in this literature is that the error probability does not need to depend onk(see Exercise 33.10)). The price, of course, is that the simple regret is in general uncontrolled. In a way, ordinal optimization is a natural generalization of best-arm identification. As such, it also leads to algorithmic choices that are not the best fit when the actual goal is to keep the simple regret small. Based on a Bayesian reasoning, a heuristic expression for the asymptotically optimal allocation of samples for the Gaussian best-arm identification problem is given by Chen et al. [2000]. They call the problem of finding an optimal allocation the “optimal computing budget allocation” (OCBA) problem. Their work can be viewed as the precursor to the

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources