This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]
33 Pure Exploration
All the policies proposed in this book so far were designed to maximize the
cumulative reward. As a consequence, the policies must carefully balance
exploration against exploitation. But what happens if there is no price to be paid
for exploring? Imagine, for example, that a researcher haskconfigurations of a
new drug and a budget to experiment onnmice. The researcher wants to find
the most promising drug configuration for subsequent human trials, but is not
concerned with the outcomes for the mice. Problems of this nature are called
pure explorationproblems. Although there are similarities to the cumulative
regret setting, there are also differences. This chapter outlines a variety of pure
exploration problems and describes the basic algorithmic ideas.
33.1 Simple regret
Letνbe ak-armed stochastic bandit andπ= (πt)nt=1+1be a policy. One way to
measure the performance of a policy in the pure exploration setting is thesimple
regret,
Rsimplen (π,ν) =Eνπ
[
∆An+1(ν)
]
.
The action chosen in roundn+ 1 has a special role. In the example with the mice
it represents the configuration recommended for further investigation at the end
of the trial. We start by analyzing theuniform exploration(UE) policy, which
explores deterministically for the firstnrounds and recommends the empirically
best arm in roundn+ 1. The pseudocode is provided in Algorithm 19.
1:fort= 1,...,ndo
2: ChooseAt= 1 + (tmodk)
3:end for
4:ChooseAn+1= argmaxi∈[k]μˆi(n)
Algorithm 19:Uniform exploration.
Theorem33.1. Letπbe the policy of Algorithm 19 andν∈ ESGk(1)be a 1 -