Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

33 Pure Exploration

All the policies proposed in this book so far were designed to maximize the cumulative reward. As a consequence, the policies must carefully balance exploration against exploitation. But what happens if there is no price to be paid for exploring? Imagine, for example, that a researcher haskconfigurations of a new drug and a budget to experiment onnmice. The researcher wants to find the most promising drug configuration for subsequent human trials, but is not concerned with the outcomes for the mice. Problems of this nature are called pure explorationproblems. Although there are similarities to the cumulative regret setting, there are also differences. This chapter outlines a variety of pure exploration problems and describes the basic algorithmic ideas.

33.1 Simple regret

Letνbe ak-armed stochastic bandit andπ= (πt)nt=1+1be a policy. One way to measure the performance of a policy in the pure exploration setting is thesimple regret,

Rsimplen (π,ν) =Eνπ

[

∆An+1(ν)

]

.

The action chosen in roundn+ 1 has a special role. In the example with the mice it represents the configuration recommended for further investigation at the end of the trial. We start by analyzing theuniform exploration(UE) policy, which explores deterministically for the firstnrounds and recommends the empirically best arm in roundn+ 1. The pseudocode is provided in Algorithm 19.

1:fort= 1,...,ndo 2: ChooseAt= 1 + (tmodk) 3:end for 4:ChooseAn+1= argmaxi∈[k]μˆi(n) Algorithm 19:Uniform exploration.

Theorem33.1. Letπbe the policy of Algorithm 19 andν∈ ESGk(1)be a 1 -

Bandit Algorithms

33 Pure Exploration

33.1 Simple regret

[

]

.

Get our desktop app

Company

Features

Documentation

Resources