Bandit Algorithms

(Jeff_L) #1
38.3 Finding an optimal policy ( ) 484

Theorem38.5.There exists a state ̃s∈Ssuch that the solutionvof Eq.(38.7)
satisfies the Bellman optimality equation.

Proof The result follows by showing thatε=v+ρ∗ 1 −Tv= 0. The first
constraint in Eq. (38.7) ensures thatε≥ 0. It remains to show thatε≤ 0. Let
π∗be an optimal policy andπbe the greedy policy with respect tov. Then

Pπt∗rπ∗≤Pπt∗(rπ+Pπv−Pπ∗v) =Pπt∗(ρ∗ 1 +v−ε−Pπ∗v).

Henceρ∗ 1 =ρπ


1 ≤ρ∗ 1 −Pπ∗∗εandPπ∗∗ε≤0. Sinceε≥ 0 andPπ∗∗is right
stochastic,Pπ∗∗ε= 0. Chooses ̃to be a state such thatPπ∗∗(s, ̃s)>0 for somes∈S,
which exists becausePπ∗∗is right stochastic. Then 0 = (Pπ∗∗ε)(s)≥Pπ∗∗(s, ̃s)ε(s ̃)
and henceε(s ̃) = 0. It follows thatv ̃=v−εalso satisfies the constraints in
Eq. (38.7). Becausevis a solution to Eq. (38.7),〈v, ̃ 1 〉 ≥ 〈v, 1 〉, implying that
〈ε, 1 〉≤0. Since we already showed thatε≥ 0 it follows thatε= 0.

The theorem only demonstrates the existence of a state ̃sfor which the solution
of Eq. (38.7) satisfies the Bellman optimality equation. There is a relatively
simple procedure for finding such a state using the solution to Eq. (38.6), but its
analysis depends on the basic theory of duality from linear programming, which
is beyond the scope of this text. More details are in Note 11 at the end of the
chapter. Instead we observe that one can simply solve Eq. (38.7) for all choices
of ̃sand take the first solution that satisfies the Bellman optimality equation.

38.3.1 Efficient computation


The linear programs in Eq. (38.6) and Eq. (38.7) can be solved efficiently under
assumptions that will be satisfied in subsequent applications.

The algorithm proposed in this subsection is guaranteed to run in polynomial
time, which is a standard objective in theoretical computer science. Its
practical performance, however, is usually much worse than alternatives that
suffer from exponential running time in the worst case. These issues are
discussed in Note 12 at the end of the chapter.

The general form of a linear program is an optimization problem of the form

minimizex∈Rn 〈c,x〉

subject to Ax≥b,

wherec∈RnandA∈Rm×nandb∈Rmare parameters of the problem. This
general problem can be solved in time that depends polynomially onnandm.
Whenmis very large or infinite these algorithms may become impractical, but
nevertheless one can often still solve the optimization problem in time polynomial
Free download pdf