38.3 Finding an optimal policy ( ) 48338.3 Finding an optimal policy ( )
There are many ways to find an optimal policy, including value iteration, policy
iteration and enumeration. These ideas are briefly discussed in Note 12. Here
we describe a two-step approach based on linear programming. Consider the
following constrained linear optimization problem:minimize
ρ∈R,v∈RSρ (38.6)subject to ρ+v(s)≥ra(s) +〈Pa(s),v〉for alls,a.Recall that a constrained optimization problem is said to befeasibleif the set
of values that satisfy the constraints is non-empty.Theorem38.4.The optimization problem in Eq.(38.6)is feasible and if(ρ,v)
is a solution, thenρ=ρ∗is the optimal gain.Solutions (ρ,v) to the optimization problem in Eq. (38.6) need not satisfy
the Bellman optimality equation (Exercise 38.12).Proof of Theorem 38.4 Theorem 38.2 guarantees the existence of a pair (ρ∗,v∗)
that satisfies the Bellman optimality equation:ρ∗+v∗(s) = max
a∈Ara(s) +〈Pa(s),v∗〉for alls,a.Hence the pair (ρ∗,v∗) satisfies the constraints in Eq. (38.6) and witnesses
feasibility. Next let (ρ,v) be a solution of Eq. (38.6). Since (ρ∗,v∗) satisfies the
constraints,ρ≤ρ∗is immediate. It remains to prove thatρ≥ρ∗. Letπ=πv
be the greedy policy with respect tovandπ∗be greedy with respect tov∗. By
Theorem 38.2,ρ∗=ρπ∗. Furthermore,
Pπt∗rπ∗≤Pπt∗(rπ+Pπv−Pπ∗v)≤Pπt∗(ρ 1 +v−Pπ∗v) =ρ 1 +Pπt∗v−Pπt+1∗ v.Summing overtshows thatρ∗ 1 =limn→∞^1 n∑n− 1
t=0P
πt∗rπ∗≤ρ^1 , which completes
the proof.Having found the optimal gain, the next step is to find a value function that
satisfies the Bellman optimality equation. Lets ̃∈Sand consider the following
linear program:minimize
v∈RS〈v, 1 〉 (38.7)subject to ρ∗+v(s)≥ra(s) +〈Pa(s),v〉for alls,a
v( ̃s) = 0.The second constraint is crucial in order for the minimum to exist, since otherwise
the value function can be arbitrarily small.