Bandit Algorithms

38.10 Exercises 511

(e)Run simulations in RiverSwim instances of various sizes to compare the regret of UCRL2 andε-greedy. What do you conclude?

38.28(MDPs with traps) Fix state spaceS, action-spaceAand reward functionr. Letπbe a policy with sublinear regret in all strongly connected MDPs (S,A,r,P). Now suppose that (S,A,r,P) is an MDP that is not strongly connected such that for alls∈Sthere exists a states′that is reachable froms under some policy and whereρ∗s′<maxuρ∗u. Finally, assume thatρ∗S 1 =maxuρ∗u almost surely. Prove thatπhas linear regret on this MDP.

38.29This exercise develops the ideas mentioned in Note 13. First, we need some
definitions: FixSandAand define Π 0 as the set of policies (learner strategies)
for MDPs with state spaceSand action spaceAthat achieve sublinear regret
in any strongly connected MDP with state spaceSand action spaceA. Now
consider an arbitrary finite MDPM= (S,A,P,r) that hasSas state space
andAas action space. A states∈ Sis reachable from states′∈ Sif there is
a policy that, when started ins′reaches stateswith positive probability after
oneor more steps. A set of statesC⊂Sis astrongly connected component
(SCC) if every states∈Uis reachable from every other states′∈C(allowing
for the possibility thats=s′). CallCmaximalif we cannot add more states to
Cand still maintain the SCC property. A maximal SCC is called amaximal
end-component(MEC). Show the following:

(a) Two MECsC 1 andC 2 are either equal, or disjoint.
(b)LetC 1 ,...,Ckbe all the distinct MECs of an MDP. The MDP structure
defines a connectivity overC 1 ,...,Ckas follows: Fori 6 =j, we say thatCiis
connected toCjif from some state inCiit is possible to reach some state of
Cjwith positive probability under some policy. Show that this connectivity
structure defines a directed graph, which must be acyclic.
(c)LetC 1 ,...,Cmwithm≤kbe the sinks (the nodes with no out-edges) of
this graph. Show that ifMis strongly connected thenm= 1 andC 1 =S.
(d)Show that for anyi∈[m] and for any policyπ∈Π 0 it holds thatπwill
reachCiin finite time with positive probability if the initial state distribution
assigns positive mass to the non-trap statesS\∪i∈[m]Ci.
(e)Show that fori≤m, for anys∈Ciand any actiona∈A,Pa(s,s′) = 0 for
anys′∈S\Ci, i.e.,Ciisclosed.
(f) Show that the restriction ofMtoCidefined as

Mi= (Ci,A,(Pa(s))s∈Ci,a∈A,(ra(s))s∈Ci,a∈A)

is an MDP.
(g) Show thatMiis strongly connected.
(h)Letτbe the time when the learner enters one ofC 1 ,...,Cmand letI∈[m]
be the index of the class that is entered at timeτ. That is,Sτ∈CI. Show
that ifMis strongly connected thenτ= 1 with probability one.

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources