Bandit Algorithms

38.10 Exercises 505

(b) Prove that there exists av∈RSsuch thatTγv=v.
(c) Letπbe the greedy policy with respect tov. Showv=rπ+γPπv.

(d) Prove thatv= (I−γPπ)−^1 r.
(e)Define theγ-discounted value functionvγπof a policyπas the function that for
any given states∈Sgives the total expected discounted reward of the policy
when it is started from states. Letvγ∗∈RSbe defined byv∗γ(s) =maxπvγπ(s),
s∈S. We callπ γ-discount optimal ifvγ∗=vγπ. Show that ifπis greedy with
respect tovfrom Part (b) thenπis aγ-optimal policy.

Hint For (b) you should use the contraction mapping theorem (or Banach fixed point theorem), which says that if (X,d) is a complete metric space and T:X → X satisfiesd(T(x),T(y))≤γd(x,y) forγ∈[0,1), then there exists anx∈ Xsuch thatT(x) =x. For (e) use (d) and Exercise 38.2 to show that it suffices to check thatvπγ≤vfor any Markov policyπ. Verify this by using the fact thatTγis monotone (f≤gimplies thatTγf≤Tγg) and showing that vπγ,n≤Tγn 0 holds for anyn, wherevπγ,n(s) is the total expected discounted reward of the policy when it is started from statesand is followed fornsteps. 38.9(From discounting to average reward) Recall thatH= (I−P+ P∗)−^1 ,U=H−P∗and letPγ∗= (1−γ)(I−γP)−^1. Show that

(a) limγ→ 1 −Pγ∗=P∗.

(b) limγ→ 1 −P
γ∗−P∗
1 −γ =U.
Hint For (a) start by manipulating the expressionsPγ∗Pand (Pγ∗)−^1 P∗. For
(b) considerH−^1 (Pγ∗−P∗).

38.10(Solution to Bellman optimality equation) In this exercise you
will prove Part (a) of Theorem 38.2.

(a)Prove there exists a deterministic stationary policyπand increasing sequence
of discount rates (γn) withγn<1 andlimn→∞γn= 1 such thatπis a greedy
policy with respect to the fixed pointvnofTγnfor alln.

(b)For the remainder of the exercise, fix a policyπwhose existence is guaranteed
by Part (a). Show thatρπ=ρ 1 is constant.
(c)Letv=vπbe the value function andρ=ρπthe gain of policyπ. Show that
(ρ,v) satisfies the Bellman optimality equation.

Hint For (a) use the fact that for finite MDPs there are only finitely many memoryless deterministic policies. For (b) and (c) use Exercise 38.9. 38.11(Counterintuitive solutions to the Bellman equation) Consider the deterministic Markov decision process shown below with two states and two actions. The first actionstaykeeps the state the same and the second actionGo moves the learner to the other state while incurring a reward of−1. Show that in this example solutions (ρ,v) to the Bellman optimality equations (Eq. (38.5))

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources