Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

4 Stochastic Bandits


The goal of this chapter is to formally introduce stochastic bandits. The model
introduced here will provide the basis for those other chapters in the book that
treat stochastic bandits. While the topic seems a bit mundane, it is important
to be clear about one’s assumptions and thus we spend quite a bit of time of
explaining what is behind this model. The chapter also introduces the learning
objective, and concepts like regret, explaining why we should care about these.
The one important (though simple) result in this chapter is the so-called regret
decomposition, which is presented in Section 4.5.

4.1 Core assumptions


Astochastic banditis a collection of distributionsν= (Pa:a∈ A) where
Ais the set of available actions. The learner and the environment interact
sequentially overnrounds. In each roundt∈{ 1 ,...,n}the learner chooses an
actionAt∈A, which is fed to the environment. The environment then samples
a rewardXt∈Rfrom distributionPAt and revealsXtto the learner. The
interaction between the learner (or policy) and environment induces a measure
on the sequence of outcomesA 1 ,X 1 ,A 2 ,X 2 ,...,An,Xn. Usually the horizonn
is finite, but sometimes we allow the interaction to continue indefinitely (n=∞).
The sequence of outcomes should satisfy the following assumptions:

(a)The conditional distribution of rewardXtgivenA 1 ,X 1 ,...,At− 1 ,Xt− 1 ,At
isPAt, which captures the intuition that the environment samplesXtfrom
PAtin roundt.
(b)The conditional law of action At given A 1 ,X 1 ,...,At− 1 ,Xt− 1 is
πt(·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 ) whereπ 1 ,π 2 ,...is a sequence of probability
kernels that characterize the learner. The most important element of this
assumption is the intuitive fact that the learner cannot use the future
observations in current decisions.

A mathematician might ask whether a probability space, carrying these random
elements such that (a) and (b) are satisfied, exist at all. Specific constructions
showing this in the affirmative are given Section 4.6. These constructions are also
Free download pdf