Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

34 Foundations of Bayesian Learning


Bayesian methods have been used for bandits from the beginning and dominated
research in the fifties, sixties and seventies. This chapter introduces the Bayesian
viewpoint and develops the technical tools necessary for applications in bandits.
Readers who are already familiar with the measure-theoretic Bayesian analysis
can skim Sections 34.4 and 34.6 for the notation used in subsequent chapters.

34.1 Statistical decision theory and Bayesian learning


The fundamental challenge in learning problems is that the true environment is
unknown and policies that are optimal in one environment are usually not optimal
in another. This forces the user to make tradeoffs, balancing performance between
environments. We have already discussed this in the context of finite-armed
bandits in Part IV. Here we take a step back and consider a more general setup.
LetEbe a set of environments and Π a set of policies. These could be bandit
environments/policies, but for now an abstract view is sufficient. A loss function
is a mapping`:E×Π→Rwith`(ν,π) representing the loss suffered by policyπ
in environmentν. Of course you should choose a policy that makes the loss small,
but most choices are incomparable because the loss depends on the environment.
Fig. 34.1 illustrates a typical situation with four policies. Some policies can be
eliminated from consideration because they aredominated, which means they
suffer at least as much loss as some other policy on all environments and more
loss on at least one. A policy that is not dominated is calledadmissibleor
Pareto optimaland choosing between these policies is nontrivial. One canonical
choice of admissible policy (assuming it exists) is a minimax optimal policy
π∈argminπ′supν`(ν,π′). Minimax optimal policies enjoy robustness, but the
price may be quite large on average. Would you choose the minimax optimal
policy in the example in Fig. 34.1?
In the Bayesian viewpoint the uncertainty in the environment is captured by
choosing apriorprobability measure onEthat reflects the user’s belief about
which environment is true. Having committed to a prior, theBayesian optimal
policysimply minimizes the expected loss with respect to the prior. WhenE
is countable, a measure corresponds to a probability vectorq∈ P(E) and the
Free download pdf