Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

13 Lower Bounds: Basic Ideas


LetEbe a set of stochastic bandits. Theworst-case regretof a policyπonE
is

Rn(π,E) = sup
ν∈E

Rn(π,ν).

Let Π be the set of all policies. Theminimax regretis

R∗n(E) = infπ∈ΠRn(π,E) = infπ∈Πsup
ν∈E

Rn(π,ν).

A policy is calledminimax optimalforEifRn(π,E) =R∗n(E). The valueR∗n(E)
is of interest by itself. A small value ofR∗n(E) indicates that the underlying bandit
problem is less challenging in the worst-case sense. A core activity in bandit
theory is to understand what makesR∗n(E) large or small, often focusing on its
behavior as a function of the number of roundsn.

Minimax optimality is not a property of a policy alone. It is a property of a
policy together with a set of environments and a horizon.

Finding a minimax policy is generally too computationally expensive to be
practical. For this reason we almost always settle for a policy that is nearly
minimax optimal.
One of the main results of this part is a proof of the following theorem, which
together with Theorem 9.1 shows that Algorithm 7 from Chapter 9 is minimax
optimal up to constant factors for 1-subgaussian bandits with suboptimality gaps
in [0,1].

Theorem13.1.LetEkbe the set ofk-armed Gaussian bandits with unit variance
and meansμ∈[0,1]k. Then there exists a universal constantc > 0 such that for
al lk > 1 andn≥kit holds thatR∗n(Ek)≥c


kn.

We will prove this theorem in Chapter 15, but first we give an informal
justification.
Free download pdf