Bandit Algorithms

(Jeff_L) #1
38.3 Finding an optimal policy ( ) 483

38.3 Finding an optimal policy ( )


There are many ways to find an optimal policy, including value iteration, policy
iteration and enumeration. These ideas are briefly discussed in Note 12. Here
we describe a two-step approach based on linear programming. Consider the
following constrained linear optimization problem:

minimize
ρ∈R,v∈RS

ρ (38.6)

subject to ρ+v(s)≥ra(s) +〈Pa(s),v〉for alls,a.

Recall that a constrained optimization problem is said to befeasibleif the set
of values that satisfy the constraints is non-empty.

Theorem38.4.The optimization problem in Eq.(38.6)is feasible and if(ρ,v)
is a solution, thenρ=ρ∗is the optimal gain.

Solutions (ρ,v) to the optimization problem in Eq. (38.6) need not satisfy
the Bellman optimality equation (Exercise 38.12).

Proof of Theorem 38.4 Theorem 38.2 guarantees the existence of a pair (ρ∗,v∗)
that satisfies the Bellman optimality equation:

ρ∗+v∗(s) = max
a∈A

ra(s) +〈Pa(s),v∗〉for alls,a.

Hence the pair (ρ∗,v∗) satisfies the constraints in Eq. (38.6) and witnesses
feasibility. Next let (ρ,v) be a solution of Eq. (38.6). Since (ρ∗,v∗) satisfies the
constraints,ρ≤ρ∗is immediate. It remains to prove thatρ≥ρ∗. Letπ=πv
be the greedy policy with respect tovandπ∗be greedy with respect tov∗. By
Theorem 38.2,ρ∗=ρπ


. Furthermore,


Pπt∗rπ∗≤Pπt∗(rπ+Pπv−Pπ∗v)≤Pπt∗(ρ 1 +v−Pπ∗v) =ρ 1 +Pπt∗v−Pπt+1∗ v.

Summing overtshows thatρ∗ 1 =limn→∞^1 n

∑n− 1
t=0P
πt∗rπ∗≤ρ^1 , which completes
the proof.

Having found the optimal gain, the next step is to find a value function that
satisfies the Bellman optimality equation. Lets ̃∈Sand consider the following
linear program:

minimize
v∈RS

〈v, 1 〉 (38.7)

subject to ρ∗+v(s)≥ra(s) +〈Pa(s),v〉for alls,a
v( ̃s) = 0.

The second constraint is crucial in order for the minimum to exist, since otherwise
the value function can be arbitrarily small.
Free download pdf