Bandit Algorithms

38.10 Exercises 503

distributionμ∈ P(S) show there exists a probability space (Ω,F,P) and an infinite sequence of random elementsS 1 ,A 1 ,S 2 ,A 2 ,...such that for anys∈S, a∈Aandt∈N,

(a)P(S 1 =s) =μ(s).
(b)P(St+1=s|S 1 ,A 1 ,...,St,At) =PAt(St,s).
(c)P(At=a|S 1 ,A 1 ,...,St) =πt(a|S 1 ,A 1 ,...,St− 1 ,At− 1 ,St).

Hint Use Theorem 3.3.

38.2(Sufficiency of Markov policies) LetM= (S,A,P) be a finite controlled Markov environment,πbe an arbitrary policy andμ∈ P(S) an arbitrary initial state distribution. Denote byPπμthe probability distribution that results from the interconnection ofπandMwhile the initial state distribution is μ.

(a) Show there exists a Markov policyπ′such that

Pπμ(St=s,At=a) =Pπ

′
μ(St=a,At=a).
for allt≥1 ands,a∈S×A.
(b)Conclude that for any policyπthere exists a Markov policyπ′such that for
anys∈S, ̄ρπs= ̄ρπ

′ s. Hint Defineπ′in an inductively starting att= 1. Puterman [Theorem 5.5.1 2009] proves this result and credits Strauch [1966].

38.3(Deterministic policies minimize travel time) LetP be some transition structure over some finite state spaceSand some finite action space A. Show that the expected travel time between two statess,s′ofSis minimized by a deterministic policy.

Hint Letτ∗(s,s′) be the shortest expected travel time between some arbitrary pairs of states, which fors=s′is defined to be zero. Show thatτ∗satisfies the fixed point equation

τ∗(s,s′) =

{

0 , ifs=s′; 1 + mina

∑

s′′Pa(s,s

′′)τ∗(s′′,s′), otherwise.

38.4(Strongly connected⇔finite diameter) LetMbe a finite MDP. Prove thatD(M)<∞is equivalent toMbeing strongly connected.

38.5(Diameter lower bound) LetM= (S,A,P,r) be any MDP. Show that D(M)≥logA(S)−3. Hint Denote by d∗(s,s′) the minimum expected time it takes to reach states′when starting from states. The definition ofd∗can be extended to arbitrary initial distributionsμ 0 over states and setsU⊂ Sof target states:

Bandit Algorithms

{

∑

Get our desktop app

Company

Features

Documentation

Resources