Bandit Algorithms

(Jeff_L) #1
38.10 Exercises 506

are exactly the elements of the set
{
(ρ,v)∈R×R^2 :ρ= 0, v(1)− 1 ≤v(2)≤v(1) + 1

}


.


1 2

r=− 1

r=− 1

r= 0 r= 0

38.12(Dangers of linear program relaxation) Give an example of a
Markov decision process and a solution (ρ,v) to the linear program in Eq. (38.6)
such thatvdoes not satisfy the Bellman optimality equation and the greedy
policy with respect tovis not optimal.

38.13(Bound on span in terms of diameter) LetMbe a strongly connected
MDP and (ρ,v) be a solution to the Bellman optimality equation. Show that
span(v)≤(ρ∗−mins,ara(s))D(M).

Hint Note that by Theorem 38.2,ρ=ρ∗. Fix some statess 16 =s 2 and a
memoryless policyπ. Show that

v(s 2 )−v(s 1 )≤(ρ∗−mins,ara(s))Eπ[τs 2 |S 1 =s 1 ].

Note for the sake of curiosity that the above display continues to hold for weakly
communicating MDPs.

The proof of Theorem 4 in the paper by Bartlett and Tewari [2009] is
incorrect. The problem is that the statement needs to hold for any solution
vof the Bellman optimality equation. The proof uses an argument that
hinges on the fact that in an aperiodic strongly connected MDP,vis in the
set{c 1 +limn→∞Tn 0 −nρ∗:c∈R}. However, Exercise 38.11 shows that
there exist strongly connected MDPs where this does not hold.

38.14(Separation oracles) Solve the following problems:

(a)Prove that Algorithm 26 provides a separation oracle for convex setKdefined
in Eq. (38.10).
(b)Assuming that Algorithm 26 can be implemented efficiently, explain how to
find an approximate solution to Eq. (38.7).


38.15(Combining separation oracles) LetK⊂Rdbe a convex set andφbe
a separation oracle forK. Suppose thata 1 ,...,anis a collection of vectors with
ak∈Rdandb 1 ,...,bkbe a collection of scalars. LetHk={x∈Rd:〈ak,x〉≥bk}.
Devise an efficient separation oracle for

⋂n
k=1K∩Hk.
Free download pdf