Bandit Algorithms

38.3 Finding an optimal policy ( ) 484

Theorem38.5.There exists a state ̃s∈Ssuch that the solutionvof Eq.(38.7) satisfies the Bellman optimality equation.

Proof The result follows by showing thatε=v+ρ∗ 1 −Tv= 0. The first constraint in Eq. (38.7) ensures thatε≥ 0. It remains to show thatε≤ 0. Let π∗be an optimal policy andπbe the greedy policy with respect tov. Then

Pπt∗rπ∗≤Pπt∗(rπ+Pπv−Pπ∗v) =Pπt∗(ρ∗ 1 +v−ε−Pπ∗v).

Henceρ∗ 1 =ρπ

∗ 1 ≤ρ∗ 1 −Pπ∗∗εandPπ∗∗ε≤0. Sinceε≥ 0 andPπ∗∗is right stochastic,Pπ∗∗ε= 0. Chooses ̃to be a state such thatPπ∗∗(s, ̃s)>0 for somes∈S, which exists becausePπ∗∗is right stochastic. Then 0 = (Pπ∗∗ε)(s)≥Pπ∗∗(s, ̃s)ε(s ̃) and henceε(s ̃) = 0. It follows thatv ̃=v−εalso satisfies the constraints in Eq. (38.7). Becausevis a solution to Eq. (38.7),〈v, ̃ 1 〉 ≥ 〈v, 1 〉, implying that 〈ε, 1 〉≤0. Since we already showed thatε≥ 0 it follows thatε= 0.

The theorem only demonstrates the existence of a state ̃sfor which the solution of Eq. (38.7) satisfies the Bellman optimality equation. There is a relatively simple procedure for finding such a state using the solution to Eq. (38.6), but its analysis depends on the basic theory of duality from linear programming, which is beyond the scope of this text. More details are in Note 11 at the end of the chapter. Instead we observe that one can simply solve Eq. (38.7) for all choices of ̃sand take the first solution that satisfies the Bellman optimality equation.

38.3.1 Efficient computation

The linear programs in Eq. (38.6) and Eq. (38.7) can be solved efficiently under assumptions that will be satisfied in subsequent applications.

The algorithm proposed in this subsection is guaranteed to run in polynomial time, which is a standard objective in theoretical computer science. Its practical performance, however, is usually much worse than alternatives that suffer from exponential running time in the worst case. These issues are discussed in Note 12 at the end of the chapter.

The general form of a linear program is an optimization problem of the form

minimizex∈Rn 〈c,x〉

subject to Ax≥b,

wherec∈RnandA∈Rm×nandb∈Rmare parameters of the problem. This general problem can be solved in time that depends polynomially onnandm. Whenmis very large or infinite these algorithms may become impractical, but nevertheless one can often still solve the optimization problem in time polynomial

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources