Bandit Algorithms

38.10 Exercises 504

d∗(μ 0 ,U) =

∑

sμ^0 (s)

∑

s′∈Ud

∗(s,s′). Prove by induction on the size ofUthat

d∗(μ 0 ,U)≥min







∑

k≥ 0

knk

∣∣

∣∣^0 ≤nk≤A

k,k≥ 0 ,∑ k≥ 0

nk=|U|





 (38.25)

and then conclude that the proposition holds by choosingU=S[Jaksch et al., 2010, Cor. 15].

38.6 (State visitation probabilities and cumulative reward) Let M= (S,A,P,r) be an MDP andπa memoryless policy andi,j∈[S].

(a)Show thate>iPπtejis the probability of arriving in statejfrom stateiint
rounds using policyπ.
(b)Show thate>i

∑n t=1P πtrπis the expected cumulative reward collected by policyπovernrounds when starting in statei.

38.7(Stochastic matrices) LetPbe any S×S right stochastic matrix. Show that the following hold:

(a)An=n^1

∑n− 1
t=0Ptis right stochastic.
(b)An+n^1 (Pn−I) =AnP=PAn.
(c)P∗= limn→∞^1 n

∑n− 1
t=0Ptexists and is right stochastic.
(d)P∗P=PP∗=P∗P∗=P∗.
(e) The matrixH= (I−P+P∗)−^1 is well defined.
(f) LetU=H−P∗. ThenU= limn→∞^1 n

∑n i=1

∑i− 1
k=0(Pk−P∗).
(g)Letr∈RSandρ=P∗r. Thenv=limn→∞^1 n

∑n i=1

∑i− 1
k=0Pk(r−ρ) is well
defined and satisfies (38.3).
(h) With the notation of the previous part,v+ρ=r+Pv.

Hint Note that the first four parts of this exercise are the same as in Chapter 37.
For Parts (c) and (d) you will likely find it useful that the space of right stochastic
matrices is compact. Then show that all cluster points of (An) are the same. For
(g) show thatv=Ur.

The previous exercise shows that the gain and differential value function of any memoryless policy in any MDP are well defined. The matrixHis called thefundamental matrixandUis called thedeviation matrix.

38.8(Discounted MDPs) Letγ∈(0,1) and define the operatorTγ:RS→RS by

(Tγv)(s) = maxa∈Ara(s) +γ〈Pa(s),v〉.

(a) Prove thatTγis a contraction with respect to the supremum norm:

‖Tγv−Tγw‖∞≤γ‖v−w‖∞for anyv,w∈RS.

Bandit Algorithms

∑

∑







∑

∣∣

∣∣





 (38.25)

Get our desktop app

Company

Features

Documentation

Resources