Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

35 Bayesian Bandits


The first section of this chapter provides simple bounds on the Bayesian optimal
regret, which are obtained by integrating the regret guarantees for frequentist
algorithms studied in Part II. This is followed by a short interlude on the basic
theory of optimal stopping. The final sections are devoted to special cases where
computing the Bayesian optimal policy is tractable. The highlight is a proof of
Gittins celebrated result characterizing the Bayesian optimal policy for finite-
armed bandits with an infinite-horizon and discounted rewards.

35.1 Bayesian optimal regret fork-armed stochastic bandits


Even in relatively benign setups, the computation of the Bayesian optimal policy
appears hopelessly intractable. Nevertheless, one can investigate the value of the
Bayesian optimal regret by proving upper and lower bounds.
For simplicity we restrict our attention to Bernoulli bandits, but the arguments
generalize more broadly. Let (E,G) = ([0,1]k,B([0,1]k)), forν ∈[0,1]klet
Pνj=B(νj). Choose some priorQon (E,G). The Bayesian optimal regret is
necessarily smaller than the minimax regret, which by Theorem 9.1 means that

BR∗n(Q)≤C


kn,

whereC >0 is a universal constant. The proof of the lower bound in Exercise 15.2
shows that for eachnthere exists a priorQfor which

BR∗n(Q)≥c


kn,

where c > 0 is a universal constant. These two together show that the
supQBR∗n(Q) = Θ(


kn).
Turning to the asymptotics for a fixed distribution, recall that that for any fixed
Bernoulli bandit environment, the asymptotic growth rate of regret is Θ(log(n)).
In stark contrast to this, the best we can say in the Bayesian case is that the
asymptotic growth rate ofBR∗n(Q) is slower than


n, but for some priors


nis
almost a lower bound on the growth rate. In particular, we ask you to prove the
following theorem in Exercise 35.1:
Free download pdf