33.5 Bibliographical remarks 389
the simple-regret for algorithms that construct gradient estimates by injecting
random noise (as is done by [Katkovnik and Kulchitsky, 1972, Nemirovsky and
Yudin, 1983] and others), which, together with theO(n−^1 /^2 ) upper bound by
Nemirovsky and Yudin [1983] (see also Agarwal et al. 2013, Liang et al. 2014),
establishes the inferiority of this approach in thendregime. Interestingly,
empirical evidence favors these gradient based techniques in comparison to the
‘optimal algorithms’. Thus, much room remains to improve our understanding of
this problem. This setting is to be contrasted to the one whenunbiased, noisy
estimates of the gradient can be obtained where methods, such as mirror descent
(see Chapter 28), give optimal rates. This is a much better understood problem
with matching lower and upper bounds available on the minimax simple regret
for various settings (for example, Chapter 5 of Nemirovsky and Yudin [1983], or
Rakhlin et al. [2012]).
Variants of the pure exploration problem are studied in a branch of statistics
calledranking and selection. The earliest literature on ranking and selection
goes back to at least the 1950s. A relatively recent paper that gives a glimpse
into a small corner of this literature is by Chan and Lai [2006]. The reason we
cite this paper is because it is particularly relevant for this chapter. Using our
terminology, Chan and Lai consider the PAC setting in the parametric setting
when the distributions underlying the arms belong to some known exponential
family of distributions. A procedure that is similar to the track-and-stop procedure
considered here is shown to be both sound and asymptotically optimal as the
confidence parameter approaches one. We also like the short and readable review
of the literature up to the 1980s from the perspective of simulation optimization
by Goldsman [1983].
A related setting studied mostly in the operations research community is
ordinal optimization. In its simplest form, ordinal optimization is concerned
with finding an arm amongst theαkarms with the highest payoffs. Ho et al. [1992],
who defined this problem in the stochastic simulation optimization literature,
emphasized that the probability of failing to find one of the ‘good arms’ decays
exponentially with the number of observationsnper arm, in contrast to the
slown−^1 /^2 decay of the error of estimating the value of the best arm, which
this literature calls the problem ofcardinal optimization. Given the results in
this chapter, this should not be too surprising. A nice twist in this literature is
that the error probability does not need to depend onk(see Exercise 33.10)).
The price, of course, is that the simple regret is in general uncontrolled. In a
way, ordinal optimization is a natural generalization of best-arm identification.
As such, it also leads to algorithmic choices that are not the best fit when the
actual goal is to keep the simple regret small. Based on a Bayesian reasoning, a
heuristic expression for the asymptotically optimal allocation of samples for the
Gaussian best-arm identification problem is given by Chen et al. [2000]. They
call the problem of finding an optimal allocation the “optimal computing budget
allocation” (OCBA) problem. Their work can be viewed as the precursor to the