38.1 Problem setup 477
then samplesSt+1from the probability vectorPAt(St) and the next round begins
(Fig. 38.1).
Observe stateSt
Choose actionAt ∈ A
Receive rewardrAt(St) UpdateSt+1 ∼ PAt(St)
Incrementt
t= 1 and sampleS 1 ∼μ
Figure 38.1Interaction protocol for Markov decision processes.
Although the action set is the same in all states, this does not mean that
Pa(s) orra(s) has any relationship toPa(s′) orra(s′) for statess 6 =s′. In
this sense it might be better to use an entirely different set of actions for
each state, which would not change the results we present. And while we
are at it, of course one could also allow the number of actions to vary over
the state space.
Histories and policies
Before considering the learning problem, we explain how to act in a known MDP.
Because there is no learning going on we call our protagonist the ‘agent’ rather
than ‘learner’. In a stochastic bandit the optimal policy given knowledge of the
bandit is to choose the action with the largest expected reward in every round.
In a Markov decision process the definition of optimality is less clear.
The history Ht = (S 1 ,A 1 ,...,St− 1 ,At− 1 ,St) in round t contains the
information available before the action for the round is to be chosen. Note
that stateStis included inHt. The actions are also included because the agent
may randomize. For simplicity the rewards are omitted because the all-knowing
agent can recompute them if needed from the state-action pairs.
Apolicyis a (possibly randomized) map from the set of possible histories
to actions. Simple policies includememoryless policies, which choose actions
based on only the current state, possibly in a randomized manner. The set
of such policies is denoted by ΠMand its elements are identified with maps
π:A×S →[0,1] with
∑
a∈Aπ(a|s) = 1 for anys∈ S so thatπ(a|s) is
interpreted as the probability that policyπtakes actionain states.