Bandit Algorithms

(Jeff_L) #1
38.10 Exercises 505

(b) Prove that there exists av∈RSsuch thatTγv=v.
(c) Letπbe the greedy policy with respect tov. Showv=rπ+γPπv.


(d) Prove thatv= (I−γPπ)−^1 r.
(e)Define theγ-discounted value functionvγπof a policyπas the function that for
any given states∈Sgives the total expected discounted reward of the policy
when it is started from states. Letvγ∗∈RSbe defined byv∗γ(s) =maxπvγπ(s),
s∈S. We callπ γ-discount optimal ifvγ∗=vγπ. Show that ifπis greedy with
respect tovfrom Part (b) thenπis aγ-optimal policy.


Hint For (b) you should use the contraction mapping theorem (or Banach
fixed point theorem), which says that if (X,d) is a complete metric space and
T:X → X satisfiesd(T(x),T(y))≤γd(x,y) forγ∈[0,1), then there exists
anx∈ Xsuch thatT(x) =x. For (e) use (d) and Exercise 38.2 to show that
it suffices to check thatvπγ≤vfor any Markov policyπ. Verify this by using
the fact thatTγis monotone (f≤gimplies thatTγf≤Tγg) and showing that
vπγ,n≤Tγn 0 holds for anyn, wherevπγ,n(s) is the total expected discounted reward
of the policy when it is started from statesand is followed fornsteps.
38.9(From discounting to average reward) Recall thatH= (I−P+
P∗)−^1 ,U=H−P∗and letPγ∗= (1−γ)(I−γP)−^1. Show that

(a) limγ→ 1 −Pγ∗=P∗.


(b) limγ→ 1 −P
γ∗−P∗
1 −γ =U.
Hint For (a) start by manipulating the expressionsPγ∗Pand (Pγ∗)−^1 P∗. For
(b) considerH−^1 (Pγ∗−P∗).


38.10(Solution to Bellman optimality equation) In this exercise you
will prove Part (a) of Theorem 38.2.


(a)Prove there exists a deterministic stationary policyπand increasing sequence
of discount rates (γn) withγn<1 andlimn→∞γn= 1 such thatπis a greedy
policy with respect to the fixed pointvnofTγnfor alln.


(b)For the remainder of the exercise, fix a policyπwhose existence is guaranteed
by Part (a). Show thatρπ=ρ 1 is constant.
(c)Letv=vπbe the value function andρ=ρπthe gain of policyπ. Show that
(ρ,v) satisfies the Bellman optimality equation.


Hint For (a) use the fact that for finite MDPs there are only finitely many
memoryless deterministic policies. For (b) and (c) use Exercise 38.9.
38.11(Counterintuitive solutions to the Bellman equation) Consider
the deterministic Markov decision process shown below with two states and two
actions. The first actionstaykeeps the state the same and the second actionGo
moves the learner to the other state while incurring a reward of−1. Show that
in this example solutions (ρ,v) to the Bellman optimality equations (Eq. (38.5))
Free download pdf