33.5 Bibliographical remarks 387
of selecting a suboptimal arm decays only polynomially withn. This result
holds no matter howAn+1is selected.
3 Although there is no exploration/exploitation dilemma in the pure exploration
setting, there is still an ‘exploration dilemma’ in the sense that the optimal
exploration policy depends on an unknown quantity. This means the policy
must balance (to some extent) the number of samples dedicated to learning
how to explore relative to those actually exploring.
4 Best-arm identification is a popular topic that lends itself to simple analysis and
algorithms. The focus on the correct identification of an optimal arm makes us
question the practicality of the setting, however. In reality any suboptimal arm is
acceptable provided its suboptimality gap is small enough relative to the budget,
which is more faithfully captured by the simple regret criterion. Of course
the simple regret may be bounded naively byRsimplen ≤maxi∆iP
(
∆An+1> 0
)
,
which is tight in some circumstances and loose in others.
5 An equivalent form of the bound shown in Theorem 33.5 is
Eνπ[τ]≥min
{k
∑
i=1
αi:α 1 ,...,αk≥ 0 , inf
ν∈Ealt(ν)
∑k
i=1
αiD(νi,νi′)≥log(4/δ)
}
This form follows immediately from Eq. (33.5) by noting that∑ Eνπ[τ] =
iEνπ[Ti(τ)]. The version given in the theorem is preferred because it is
a closed form expression. Exercise 33.3 asks you to explore the relation between
the two forms.
6 The forced exploration in the Track-and-Stop algorithm is sufficient for
asymptotic optimality. We are uneasy about the fact that the proof would work
for any thresholdCtpwithp∈(0,1). There is nothing fundamental about
√
t.
We do not currently know of a principled way to tune the amount of forced
exploration or if there is better algorithm design for best-arm identification.
Ideally one should provided finite-time upper bounds that match the finite-time
lower bound provided by Theorem 33.5. The extent to which this is possible
appears to be an open question.
7 The choice ofβt(δ) significantly influences the practical performance of Track-
and-Stop. We believe the analysis given here is mostly tight except that the
naive concentration bound given in Lemma 33.8 can be improved.
8 Perhaps the most practical setup in pure exploration has not yet received any
attention, which is upper and lower instance-dependent bounds on the simple
regret. Even better, an analysis of the distribution of ∆An+1.
33.5 Bibliographical remarks
In the machine learning literature, pure exploration for bandits seems to have been
first studied by Even-Dar et al. [2002], Mannor and Tsitsiklis [2004], Even-Dar
et al. [2006] in the ‘Probability Approximately Correct’ setting where the objective