Bandit Algorithms

34.7 Notes 406

3 The relationship between admissibility, Bayesian optimality and minimax
optimality is one of the main topics of statistical decision theory, which intersects
heavily with game theory. In many classical statistical settings all Bayesian
optimal policies are admissible (Exercise 34.12) and all admissible policies
are either Bayesian optimal for some prior or limit point of a sequence of
Bayesian optimal policies (Exercise 34.11). Be warned, however, there are
counterexamples. A nice book with many examples is by Berger [1985]. In
Exercise 34.13 you will prove that all admissible policies for stochastic Bernoulli
bandits are Bayesian optimal for some prior.
4 While admissibility and related notions of optimality are helpful in being clear
about the goals of algorithm design, we must recognize that these concepts are
too binary for most purposes. One problem with the classic decision theory
literature is that it puts too much emphasis on these narrow concepts. Who
would argue that a policy that is dominated, but just barely, is worth nothing?
Especially since the optimal policy is often intractable. Meaningful ways of
defining slightly worse usually consider a bigger picture when a policy design
approach (policy schema) is evaluated across many problem classes. In the
Bayesian setting, one may for example consider allk-armed stochastic Bayesian
bandits with (say) bounded rewards and consider policy schema that work no
matter what the environment is. An example of a policy schema is Thompson
sampling since it can be instantiated for any of the environments. One may ask
whether such a policy schema is near Bayesian- (or minimax-) optimal across
all of the considered environments. In fact, most bandit algorithm design is
better viewed as designing policy schema.
5 Sion’s minimax theorem provides a connection between minimax optimal regret
and the maximum Bayesian optimal regret over all priors. In Exercise 34.7
you will show that all policies can be viewed as probability measures over
deterministic policies. LetPbe a space of probability measures on deterministic
policies andQbe a space of probability measures on (E,G), which are subsets
of the topological vector spaces of all signed measures on deterministic policies
and environments respectively. Defineg:P ×Q→Rby

g(μ,Q) =

∫

ΠD

∫

E

Rn(π,ν)dQ(ν)dμ(π),

which is linear in both arguments because integrals are linear as a function of the measure. Suppose thatgis continuous in both arguments andPandQare convex and at least one of them is compact. Then by Sion’s minimax theorem (Theorem 28.12), sup Q∈Q

inf μ∈P

g(μ,Q) = inf μ∈P

sup Q∈Q

g(μ,Q). (34.9)

UsuallyPandQinclude all Dirac measures: {δπ:π∈ΠD}⊆P and {δν:ν∈E}⊆Q.

Then the left-hand side of Eq. (34.9) issupQ∈QBR∗n(Q) and the right-hand

Bandit Algorithms

∫

∫

Get our desktop app

Company

Features

Documentation

Resources