Bandit Algorithms

38.3 Finding an optimal policy ( ) 483

38.3 Finding an optimal policy ( )

There are many ways to find an optimal policy, including value iteration, policy iteration and enumeration. These ideas are briefly discussed in Note 12. Here we describe a two-step approach based on linear programming. Consider the following constrained linear optimization problem:

minimize ρ∈R,v∈RS

ρ (38.6)

subject to ρ+v(s)≥ra(s) +〈Pa(s),v〉for alls,a.

Recall that a constrained optimization problem is said to befeasibleif the set of values that satisfy the constraints is non-empty.

Theorem38.4.The optimization problem in Eq.(38.6)is feasible and if(ρ,v) is a solution, thenρ=ρ∗is the optimal gain.

Solutions (ρ,v) to the optimization problem in Eq. (38.6) need not satisfy the Bellman optimality equation (Exercise 38.12).

Proof of Theorem 38.4 Theorem 38.2 guarantees the existence of a pair (ρ∗,v∗) that satisfies the Bellman optimality equation:

ρ∗+v∗(s) = max a∈A

ra(s) +〈Pa(s),v∗〉for alls,a.

Hence the pair (ρ∗,v∗) satisfies the constraints in Eq. (38.6) and witnesses feasibility. Next let (ρ,v) be a solution of Eq. (38.6). Since (ρ∗,v∗) satisfies the constraints,ρ≤ρ∗is immediate. It remains to prove thatρ≥ρ∗. Letπ=πv be the greedy policy with respect tovandπ∗be greedy with respect tov∗. By Theorem 38.2,ρ∗=ρπ

∗

. Furthermore,

Pπt∗rπ∗≤Pπt∗(rπ+Pπv−Pπ∗v)≤Pπt∗(ρ 1 +v−Pπ∗v) =ρ 1 +Pπt∗v−Pπt+1∗ v.

Summing overtshows thatρ∗ 1 =limn→∞^1 n

∑n− 1 t=0P πt∗rπ∗≤ρ^1 , which completes the proof.

Having found the optimal gain, the next step is to find a value function that satisfies the Bellman optimality equation. Lets ̃∈Sand consider the following linear program:

minimize v∈RS

〈v, 1 〉 (38.7)

subject to ρ∗+v(s)≥ra(s) +〈Pa(s),v〉for alls,a v( ̃s) = 0.

The second constraint is crucial in order for the minimum to exist, since otherwise the value function can be arbitrarily small.

Bandit Algorithms

38.3 Finding an optimal policy ( )

Get our desktop app

Company

Features

Documentation

Resources