Bandit Algorithms

38.9 Bibliographical remarks 500

〈Pˆk−P,vk〉, where the dependence on the state and action has been omitted. Of course the algorithm must be changed to use the improved confidence intervals. At first sight it seems that one could apply Hoeffding’s bound directly to the inner product, but there is a subtle problem that has spoiled a number of attempts:vkandPˆkare not independent. This non-independence is unfortunately quite pernicious and appears from many angles. We advise extreme caution (some references for guidance are given in the bibliographic remarks).

38.9 Bibliographical remarks

Richard Bellman

The study of sequential decision making has a long history and we recommend the introduction of the book by Puterman [2009] as a good starting point. One of the main architects in modern times is Richard Bellman, who wrote an influential book [Bellman, 1954]. His autobiography is so entertaining that reading it slowed the writing of this chapter: “The Eye of the Hurricane” [Bellman, 1984]. As a curiosity, Bellman knew about bandit problems after accidentally encountering a paper by Thompson [1935]. For the tidbit see page 260 of the aforementioned biography. Markov decision processes are studied by multiple research communities, including control, operations research and artificial intelligence. The two-volume book by Bertsekas [2012] provides a thorough and formal introduction to the basics. The perspective is quite interdisciplinary, but with a slight (good) bias towards the control literature. The perspective of an operations researcher is most precisely conveyed in the comprehensive book by Puterman [2009]. A very readable shorter introductory book is by Ross [1983]. Arapostathis et al. [1993] surveyed existing analytical results (existence, uniqueness of optimal policies, validity of the Bellman optimality equation) for average-reward MDPs with an emphasis on continuous state and action space models. The online lecture notes of Kallenberg [2016] are a recent comprehensive alternative account for the theory of discrete MDPs. There are many texts on linear/convex optimization and the ellipsoid method. The introductory book on linear optimization by Bertsimas and Tsitsiklis [1997] is a pleasant read while the ellipsoid method is explained in detail by Gr ̈otschel et al. [2012]. The problem considered in this chapter is part of a broader field called reinforcement learning (RL), which has recently seen a surge of interest. The books by Sutton and Barto [2018] and Bertsekas and Tsitsiklis [1996] describe the foundations. The first book provides an intuitive introduction aimed at computer scientists, while the second book focuses on the theoretical results of

Bandit Algorithms

38.9 Bibliographical remarks

Get our desktop app

Company

Features

Documentation

Resources