This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]
38 Markov Decision Processes
Bandit environments are a sensible model for many simple problems, but they do
not model more complex environments where actions have long-term consequences.
A brewing company needs to plan ahead when ordering ingredients and the
decisions made today affect their position to brew the right amount of beer in
the future. A student learning mathematics benefits not only from the immediate
reward of learning an interesting topic, but also from their improved job prospects.
AMarkov decision processis a simple way to incorporate long-term
planning into the bandit framework. Like in bandits, the learner chooses actions
and receives rewards. But they also observe astateand the rewards for different
actions depend on the state. Furthermore, the actions chosen affect which state
will be observed next.
38.1 Problem setup
A Markov decision process (MDP) is defined by a tupleM= (S,A,P,r,μ). The
first two itemsSandAare sets called thestate spaceandaction spaceand
S =|S|and A =|A|are their sizes, which may be infinite. An MDP is finite
if S,A<∞. The quantityP = (Pa:s∈ S,a∈ A) is called thetransition
functionwithPa:S ×S →[0,1] so thatPa(s,s′) is the probability that the
learner transitions from statestos′when taking actiona. The fourth element of
the tuple isr= (ra:a∈ A), which is a collection ofreward functionswith
ra:S →[0,1]. When the learner takes actionain statesit receives a deterministic
reward ofra(s). The last element isμ∈P(S), which is a distribution over the
states that determines the starting state. The transition and reward functions are
often represented by vectors or matrices. When the state space is finite we may
assume without loss of generality thatS= [S]. We writePa(s)∈[0,1]Sas the
probability vector withs′th coordinate given byPa(s,s′). In the same way we let
Pa∈[0,1]S×Sbe the right stochastic matrix with (Pa)s,s′=Pa(s,s′). Finally, we
viewraas a vector in [0,1]Sin the natural way.
The interaction protocol is similar to bandits. Before the game starts, the
initial stateS 1 is sampled fromμ. In each roundtthe learner observes the state
St∈S, chooses an actionAt∈Aand receives rewardrAt(St). The environment