38.6 Proof of upper bound 490
equation forM ̃kyields a value functionvkand constant gainρk∈Rthat satisfy
Eq. (38.16). A minor detail is that the extended action sets are infinite while
the analysis in previous sections only demonstrated existence of solutions to
the Bellman optimality equation for finite MDPs. You should convince yourself
thatCt(s,a) is convex and has finitely many extremal points. Restricting the
confidence sets to these points makes the extended MDP finite without changing
the optimal policy.
38.5.2 Computing the optimistic policy ( )
Here we explain how to efficiently solve the Bellman optimality equation for the
extended MDP. The results in Section 38.3 show that the Bellman optimality
equation forM ̃kcan be solved efficiently provided that for any value function
v∈RScomputing
argmaxa∈A
(
ra(s) + max
P∈Cτk(s,a)
〈P,v〉
)
(38.17)
can be carried out in an efficient manner. The inner optimization is another linear
program with S variables andO(S) constraints and can be solved in polynomial
time. This procedure is repeated for eacha∈ Ato compute the outcome of
(38.17). In fact the inner optimization can be solved more straightforwardly by
sorting the entries ofvand then allocatingPcoordinate-by-coordinate to be as
large as allowed by the constraints in decreasing order ofv. The total computation
cost of solving Eq. (38.17) in this way isO(S(A +logS)). Combining this with
Algorithm 26 gives the required separation oracle.
The next problem is to find anRsuch that the set of feasible solutions to the
linear programs in Eq. (38.6) and Eq. (38.7) are contained in the set{x:‖x‖≤R}.
As discussed in Section 38.3.1, a suitable value isR=
√
1 +D^2 SwhereDis
an upper bound on the diameter of the MDP. It turns out thatD=
√
nworks
because for each pair of statess,s′there exists an actionaandP∈ Cτk(s,a)
such thatP(s,s′)≥ 1 ∧(1/
√
n) soD(M ̃k)≤
√
n. Combining this with the tools
developed in Section 38.3 shows that the Bellman optimality equation forM ̃kmay
be solved using linear programming in polynomial time. Note that the additional
constraints require a minor adaptation of the separation oracle, which we leave
to the reader.
38.6 Proof of upper bound
The proof is developed in three steps. First we decompose the regret into phases
and define a failure event where the confidence intervals fail. In the second step
we bound the regret in each phase and in the third step we sum over the phases.
Recall thatM= (S,A,P,r) is the true Markov decision process with diameter
D=D(M). The initial state distribution isμ∈P(S), which is arbitrary.