Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

35 Bayesian Bandits

The first section of this chapter provides simple bounds on the Bayesian optimal regret, which are obtained by integrating the regret guarantees for frequentist algorithms studied in Part II. This is followed by a short interlude on the basic theory of optimal stopping. The final sections are devoted to special cases where computing the Bayesian optimal policy is tractable. The highlight is a proof of Gittins celebrated result characterizing the Bayesian optimal policy for finite- armed bandits with an infinite-horizon and discounted rewards.

35.1 Bayesian optimal regret fork-armed stochastic bandits

Even in relatively benign setups, the computation of the Bayesian optimal policy appears hopelessly intractable. Nevertheless, one can investigate the value of the Bayesian optimal regret by proving upper and lower bounds. For simplicity we restrict our attention to Bernoulli bandits, but the arguments generalize more broadly. Let (E,G) = ([0,1]k,B([0,1]k)), forν ∈[0,1]klet Pνj=B(νj). Choose some priorQon (E,G). The Bayesian optimal regret is necessarily smaller than the minimax regret, which by Theorem 9.1 means that

BR∗n(Q)≤C

√

kn,

whereC >0 is a universal constant. The proof of the lower bound in Exercise 15.2 shows that for eachnthere exists a priorQfor which

BR∗n(Q)≥c

√

kn,

where c > 0 is a universal constant. These two together show that the supQBR∗n(Q) = Θ(

√

kn). Turning to the asymptotics for a fixed distribution, recall that that for any fixed Bernoulli bandit environment, the asymptotic growth rate of regret is Θ(log(n)). In stark contrast to this, the best we can say in the Bayesian case is that the asymptotic growth rate ofBR∗n(Q) is slower than

√

n, but for some priors

√

nis almost a lower bound on the growth rate. In particular, we ask you to prove the following theorem in Exercise 35.1:

Bandit Algorithms

35 Bayesian Bandits

35.1 Bayesian optimal regret fork-armed stochastic bandits

√

√

√

√

√

Get our desktop app

Company

Features

Documentation

Resources