38.2 Optimal policies and the Bellman optimality equation 482
Theorem38.2.The following hold:
(a) There exists a pair(ρ,v)that satisfies the Bel lman optimality equation.
(b)If(ρ,v)satisfies the Bellman optimality equation, thenρ=ρ∗andπvis
optimal.
(c) There exists a deterministic memoryless optimal policy.
Proof sketch The proof of Part (a) is too long to include here, but we guide you
through it in Exercise 38.10. For Part (b) let (ρ,v) satisfy the Bellman equation
andπ∗=πvbe the greedy policy with respect tov. Then, by Eq. (38.2),
ρπ
∗
= limn→∞
1
n
∑n
t=1
Pπt−∗^1 rπ∗= limn→∞
1
n
∑n
t=1
Pπt−∗^1 (ρ 1 +v−Pπ∗v) =ρ 1.
Next letπbe an arbitrary Markov policy. We show thatρ ̄π≤ρ 1. The result
is then completed using the result of Exercise 38.2, where you will prove that
for any policyπthere exists a Markov policy with the same expected rewards.
Denote byπtthe memoryless policy used at timet= 1, 2 ,...when following the
Markov policyπand fort≥1 letPπ(t)=Pπ 1 ...Pπt, while fort= 0 letPπ(0)=I.
Thus,Pπ(t)(s,s′) is the probability of ending up in states′while followingπfrom
statesforttime steps. It follows thatρ ̄π=lim supn→∞^1 n
∑n
t=1P
π(t−1)r
πt. Fix
t≥1. Using the fact thatπ∗is the greedy policy with respect tovgives
Pπ(t−1)rπt=Pπ(t−1)(rπt+Pπtv−Pπtv)
≤Pπ(t−1)(rπ∗+Pπ∗v−Pπtv)
=Pπ(t−1)(ρ 1 +v−Pπtv)
=ρ 1 +Pπ(t−1)v−Pπ(t)v.
Taking the average of both sides overt∈[n] and then taking the limit shows
thatρ ̄π≤ρ 1 , finishing the proof. Part (c) follows immediately from the first two
parts.
The theorem shows that there exist solutions to the Bellman optimality equation
and that the greedy policy with respect to the resulting value function is an
optimal policy. We need one more result about solutions to the Bellman optimality
equation, the proof of which you will provide in Exercise 38.13.
Lemma38.3.Suppose that(ρ,v)satisfies the Bel lman optimality equation. Then
span(v)≤D(M).
The mapT:RS→RSdefined by (Tv)(s) =maxa∈Ara(s) +〈Pa(s),v〉is
called theBellman operator. The Bellman optimality equation can be
written asρ 1 +v=Tv.