Bandit Algorithms

38.6 Proof of upper bound 490

equation forM ̃kyields a value functionvkand constant gainρk∈Rthat satisfy Eq. (38.16). A minor detail is that the extended action sets are infinite while the analysis in previous sections only demonstrated existence of solutions to the Bellman optimality equation for finite MDPs. You should convince yourself thatCt(s,a) is convex and has finitely many extremal points. Restricting the confidence sets to these points makes the extended MDP finite without changing the optimal policy.

38.5.2 Computing the optimistic policy ( )

Here we explain how to efficiently solve the Bellman optimality equation for the extended MDP. The results in Section 38.3 show that the Bellman optimality equation forM ̃kcan be solved efficiently provided that for any value function v∈RScomputing

argmaxa∈A

(

ra(s) + max P∈Cτk(s,a)

〈P,v〉

)

(38.17)

can be carried out in an efficient manner. The inner optimization is another linear program with S variables andO(S) constraints and can be solved in polynomial time. This procedure is repeated for eacha∈ Ato compute the outcome of (38.17). In fact the inner optimization can be solved more straightforwardly by sorting the entries ofvand then allocatingPcoordinate-by-coordinate to be as large as allowed by the constraints in decreasing order ofv. The total computation cost of solving Eq. (38.17) in this way isO(S(A +logS)). Combining this with Algorithm 26 gives the required separation oracle. The next problem is to find anRsuch that the set of feasible solutions to the linear programs in Eq. (38.6) and Eq. (38.7) are contained in the set{x:‖x‖≤R}. As discussed in Section 38.3.1, a suitable value isR=

√

1 +D^2 SwhereDis an upper bound on the diameter of the MDP. It turns out thatD=

√

nworks because for each pair of statess,s′there exists an actionaandP∈ Cτk(s,a) such thatP(s,s′)≥ 1 ∧(1/

√

n) soD(M ̃k)≤

√

n. Combining this with the tools developed in Section 38.3 shows that the Bellman optimality equation forM ̃kmay be solved using linear programming in polynomial time. Note that the additional constraints require a minor adaptation of the separation oracle, which we leave to the reader.

38.6 Proof of upper bound

The proof is developed in three steps. First we decompose the regret into phases and define a failure event where the confidence intervals fail. In the second step we bound the regret in each phase and in the third step we sum over the phases. Recall thatM= (S,A,P,r) is the true Markov decision process with diameter D=D(M). The initial state distribution isμ∈P(S), which is arbitrary.

Bandit Algorithms

(

)

(38.17)

√

√

√

√

38.6 Proof of upper bound

Get our desktop app

Company

Features

Documentation

Resources