Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

34 Foundations of Bayesian Learning

Bayesian methods have been used for bandits from the beginning and dominated research in the fifties, sixties and seventies. This chapter introduces the Bayesian viewpoint and develops the technical tools necessary for applications in bandits. Readers who are already familiar with the measure-theoretic Bayesian analysis can skim Sections 34.4 and 34.6 for the notation used in subsequent chapters.

34.1 Statistical decision theory and Bayesian learning

The fundamental challenge in learning problems is that the true environment is unknown and policies that are optimal in one environment are usually not optimal in another. This forces the user to make tradeoffs, balancing performance between environments. We have already discussed this in the context of finite-armed bandits in Part IV. Here we take a step back and consider a more general setup. LetEbe a set of environments and Π a set of policies. These could be bandit environments/policies, but for now an abstract view is sufficient. A loss function is a mapping`:E×Π→Rwith`(ν,π) representing the loss suffered by policyπ in environmentν. Of course you should choose a policy that makes the loss small, but most choices are incomparable because the loss depends on the environment. Fig. 34.1 illustrates a typical situation with four policies. Some policies can be eliminated from consideration because they aredominated, which means they suffer at least as much loss as some other policy on all environments and more loss on at least one. A policy that is not dominated is calledadmissibleor Pareto optimaland choosing between these policies is nontrivial. One canonical choice of admissible policy (assuming it exists) is a minimax optimal policy π∈argminπ′supν`(ν,π′). Minimax optimal policies enjoy robustness, but the price may be quite large on average. Would you choose the minimax optimal policy in the example in Fig. 34.1? In the Bayesian viewpoint the uncertainty in the environment is captured by choosing apriorprobability measure onEthat reflects the user’s belief about which environment is true. Having committed to a prior, theBayesian optimal policysimply minimizes the expected loss with respect to the prior. WhenE is countable, a measure corresponds to a probability vectorq∈ P(E) and the

Bandit Algorithms

34 Foundations of Bayesian Learning

34.1 Statistical decision theory and Bayesian learning

Get our desktop app

Company

Features

Documentation

Resources