Bandit Algorithms

38.10 Exercises 507

38.16(Approximate solutions to Bellman equation) Consider a strongly connected MDP and suppose thatρandvapproximately satisfy the Bellman optimality equation in the sense that there exists anε >0 such that ∣∣ ∣∣ρ+v(s)−max a∈Ara(s) +〈Pa(s),v〉

∣∣

∣∣≤ε for all state-action pairss,a.

(38.26)

(a) Show thatρ≥ρ∗−ε.
(b)Let ̃πbe the greedy policy with respect tov. Assume thatπ ̃isε′-greedy
with respect tovin the sense thatr ̃π(s)(s) +〈Pπ ̃(s)(s),v〉≥maxa∈Ara(s) +
〈Pa(s),v〉 −ε′ holds for all s ∈ S. Show that π ̃ is 2ε+ε′ optimal:
ρ ̃π≥ρ∗−(2ε+ε′).
(c)Suppose thatρ∗in Eq. (38.7) is replaced withρ∈[ρ∗,ρ∗+δ]. Show that the
linear program remains feasible and the solution (ρ,v) satisfies Eq. (38.26)
withε≤|S|^2 δ.

38.17(Average-optimal is nearly finite-time optimal) LetM be a strongly connected MDP with rewards in [0,1], diameterD <∞and optimal gainρ∗. Letvn∗(s) be the maximum total expected reward innsteps when the process starts in states. Prove thatv∗n(s)≤nρ∗+D.

38.18(High probability⇒expected regret) Prove that(38.12)follows from Theorem 38.6.

38.19(Necessity of phases) The purpose of this exercise is to show that
without phases UCRL2 may suffer linear regret. For convenience we consider the
modified version of UCRL2 in Exercise 38.24 that does not know the reward.
Now suppose we further modify this algorithm to re-solve the optimistic MDP
in every round (τk=kfor allk). We make use of the following deterministic
Markov decision process with two actionsA={stay,go}represented by dashed
and solid arrows respectively.

1 2

1 / 2 1 / 2

0

Figure 38.4Transitions and rewards are deterministic. Numbers indicate the rewards.

(a) Find all memoryless optimal policies for the MDP in Fig. 38.4.

(b)Prove that the version of UCRL2 given in Exercise 38.24 modified to re-solve
the optimistic MDP in every round suffers linear regret on this MDP.

Bandit Algorithms

∣∣

Get our desktop app

Company

Features

Documentation

Resources