Bandit Algorithms

(Jeff_L) #1
38.4 Learning in Markov decision processes 486

bought by adding this slack is that now the linear program in Eq. (38.9) satisfies
the conditions (a) and (c) above. The final step is to give a condition when a
separation oracle exists for the convex set determined by the constraints in the
above program. Define convex setKby

K={(ρ,v)∈Rd+1:ε+ρ+v(s)≥ra(s) +〈Pa(s),v〉for alls,a}. (38.10)
Assuming that

argmaxa∈A(ra(s) +〈Pa(s),v〉) (38.11)

can be solved efficiently, Algorithm 26 provides a separation oracle forK. For the
specialized case considered later Eq. (38.11) is trivial to compute efficiently. The
feasible region defined by the constraints in Eq. (38.9) is the intersection ofKwith
a small number of half-spaces. In Exercise 38.15 you will show how to efficiently
extend a separation oracle for arbitrary convex setKto

⋂n
i=1Hk∩Kwhere
(Hk)nk=1are half-spaces. You will show in Exercise 38.14 that approximately
solving Eq. (38.7) works in the same way as above, as well as the correctness of
Algorithm 26.

In Theorem 38.2 we assumed an exact solution of the Bellman optimality
equation, which may not be possible in practice. Fortunately, approximate
solutions to the Bellman optimality equation with approximately greedy
policies yield approximately optimal policies. Details are deferred to
Exercise 38.16.

1:functionSeparationOracle(ρ,v)
2: For eachs∈Sfinda∗s∈argmaxa(ra(s) +〈Pa(s),v〉)
3: ifε+ρ+v(s)≥ra∗s(s) +〈Pa∗s(s),v〉for alls∈Sthen
4: returntrue
5: else
6: Find stateswithε+ρ+v(s)< ra∗s(s) +〈Pa∗s(s),v〉
7: return(1,es−Pa∗s(s))
8: end if
9:end function
Algorithm 26:Separation oracle for Eq. (38.6).

38.4 Learning in Markov decision processes


The problem of finding an optimal policy in an unknown Markov decision process
is no longer just an optimization problem and the notion of regret is introduced
to measure the price of the uncertainty. For simplicity we assume that only the
Free download pdf