Bandit Algorithms

38.3 Finding an optimal policy ( ) 485

innonly, provided that the constraints satisfy certain structural properties. Let K⊂Rnbe convex and consider the optimization problem

minimizex∈Rn 〈c,x〉 (38.8)

subject tox∈K.

Algorithms for this problem generally have a slightly different flavor becauseK
may have no corners. Suppose the following holds:

(a) There exists a knownR >0 such thatK⊂{x∈Rn:‖x‖ 2 ≤R}.
(b)There exists a separation oracle, which we recall from Chapter 27, is a
computational procedure to evaluate some functionφonRnwithφ(x) =true
forx∈ Kand otherwiseφ(x) =uwith〈y,u〉>〈x,u〉for ally∈ K(see
Fig. 27.1).
(c) There exists aδ >0 andx 0 ∈Rdsuch that{x∈Rn:‖x−x 0 ‖ 2 ≤δ}⊂K.

Under these circumstances the ellipsoid method accepts as input the size of the bounding sphereR, the separation oracle and an accuracy parameterε >0. Its output is a pointxin time polynomial innandlog(R/(δε)) such thatx∈Kand 〈c,x〉≤〈c,x∗〉+εwherex∗is the minimizer of Eq. (38.8). The reader can find references to this method at the end of the chapter. The linear programs in Eq. (38.6) and Eq. (38.7) do not have bounded feasible regions because ifvis feasible, thenv+c 1 is also feasible for anyc∈R. For strongly connected MDPs with diameterD, however, Lemma 38.3 allows us to add the constraint that‖v‖∞≤D. If the rewards are bounded in [0,1], then we may also add the constraint that 0≤ρ≤1. Together these imply that for (ν,ρ) in the feasible region,

‖(ρ,v)‖^22 =ρ^2 +‖v‖^22 ≤1 +S‖v‖^2 ∞≤1 +SD^2.

Then setR=

√

1 +D^2 S. When the diameter is unknown one may use a doubling procedure. In order to guarantee the feasible region contains a small ball we add some slack to the constraints. Letε >0 and consider the following linear program.

minimize ρ∈R,v∈RS

ρ (38.9)

subject to ε+ρ+v(s)≥ra(s) +〈Pa(s),v〉for alls,a. v(s)≥−Dfor alls v(s)≤Dfor alls ρ≤1 +εfor alls ρ≥−εfor alls.

Note that for anyxin the feasible region of Eq. (38.9) there exists aythat is feasible for Eq. (38.6) with‖x−y‖∞≤ε. Furthermore, the solution to the above linear program is at mostεaway from the solution to Eq. (38.6). What we have

Bandit Algorithms

√

Get our desktop app

Company

Features

Documentation

Resources