38.9 Bibliographical remarks 500
〈Pˆk−P,vk〉, where the dependence on the state and action has been omitted.
Of course the algorithm must be changed to use the improved confidence
intervals. At first sight it seems that one could apply Hoeffding’s bound
directly to the inner product, but there is a subtle problem that has spoiled a
number of attempts:vkandPˆkare not independent. This non-independence
is unfortunately quite pernicious and appears from many angles. We advise
extreme caution (some references for guidance are given in the bibliographic
remarks).
38.9 Bibliographical remarks
Richard Bellman
The study of sequential decision making has a long
history and we recommend the introduction of the book
by Puterman [2009] as a good starting point. One of the
main architects in modern times is Richard Bellman,
who wrote an influential book [Bellman, 1954]. His
autobiography is so entertaining that reading it slowed
the writing of this chapter: “The Eye of the Hurricane”
[Bellman, 1984]. As a curiosity, Bellman knew about
bandit problems after accidentally encountering a
paper by Thompson [1935]. For the tidbit see page
260 of the aforementioned biography.
Markov decision processes are studied by multiple
research communities, including control, operations
research and artificial intelligence. The two-volume book by Bertsekas [2012]
provides a thorough and formal introduction to the basics. The perspective
is quite interdisciplinary, but with a slight (good) bias towards the control
literature. The perspective of an operations researcher is most precisely conveyed
in the comprehensive book by Puterman [2009]. A very readable shorter
introductory book is by Ross [1983]. Arapostathis et al. [1993] surveyed existing
analytical results (existence, uniqueness of optimal policies, validity of the Bellman
optimality equation) for average-reward MDPs with an emphasis on continuous
state and action space models. The online lecture notes of Kallenberg [2016] are a
recent comprehensive alternative account for the theory of discrete MDPs. There
are many texts on linear/convex optimization and the ellipsoid method. The
introductory book on linear optimization by Bertsimas and Tsitsiklis [1997] is a
pleasant read while the ellipsoid method is explained in detail by Gr ̈otschel et al.
[2012].
The problem considered in this chapter is part of a broader field called
reinforcement learning (RL), which has recently seen a surge of interest. The
books by Sutton and Barto [2018] and Bertsekas and Tsitsiklis [1996] describe
the foundations. The first book provides an intuitive introduction aimed at
computer scientists, while the second book focuses on the theoretical results of