Bandit Algorithms

38.4 Learning in Markov decision processes 486

bought by adding this slack is that now the linear program in Eq. (38.9) satisfies the conditions (a) and (c) above. The final step is to give a condition when a separation oracle exists for the convex set determined by the constraints in the above program. Define convex setKby

K={(ρ,v)∈Rd+1:ε+ρ+v(s)≥ra(s) +〈Pa(s),v〉for alls,a}. (38.10) Assuming that

argmaxa∈A(ra(s) +〈Pa(s),v〉) (38.11)

can be solved efficiently, Algorithm 26 provides a separation oracle forK. For the specialized case considered later Eq. (38.11) is trivial to compute efficiently. The feasible region defined by the constraints in Eq. (38.9) is the intersection ofKwith a small number of half-spaces. In Exercise 38.15 you will show how to efficiently extend a separation oracle for arbitrary convex setKto

⋂n i=1Hk∩Kwhere (Hk)nk=1are half-spaces. You will show in Exercise 38.14 that approximately solving Eq. (38.7) works in the same way as above, as well as the correctness of Algorithm 26.

In Theorem 38.2 we assumed an exact solution of the Bellman optimality equation, which may not be possible in practice. Fortunately, approximate solutions to the Bellman optimality equation with approximately greedy policies yield approximately optimal policies. Details are deferred to Exercise 38.16.

1:functionSeparationOracle(ρ,v) 2: For eachs∈Sfinda∗s∈argmaxa(ra(s) +〈Pa(s),v〉) 3: ifε+ρ+v(s)≥ra∗s(s) +〈Pa∗s(s),v〉for alls∈Sthen 4: returntrue 5: else 6: Find stateswithε+ρ+v(s)< ra∗s(s) +〈Pa∗s(s),v〉 7: return(1,es−Pa∗s(s)) 8: end if 9:end function Algorithm 26:Separation oracle for Eq. (38.6).

38.4 Learning in Markov decision processes

The problem of finding an optimal policy in an unknown Markov decision process is no longer just an optimization problem and the notion of regret is introduced to measure the price of the uncertainty. For simplicity we assume that only the

Bandit Algorithms

38.4 Learning in Markov decision processes

Get our desktop app

Company

Features

Documentation

Resources