Bandit Algorithms

(Jeff_L) #1
38.10 Exercises 512

(i)We redefine the regret as follows:R′n =E

[∑τ+n− 1
t=τ rAt(St)−nρ∗(MI)

]


.
Show that ifMis strongly connected thenRn=Rn′.
(j)Show that there exists a learner such that(38.12)continues to hold in the
sense thatR′n≤1 +CE

[


D(MI)|CI|



2Anlog(n)

]


.


The logic of the regret definition in Part (i) is that by Part (d), reasonable
policies cannot control which trap they fall into in an MDP that has more
than one traps. As such, policies should not be penalized for what trap they
fall into. However, once a policy falls into some “trap”, we expect it to start
to behave near optimally. What this definition is still lacking is that it is
insensitive to how fast a policy gets trapped.

38.30(Chain rule for relative entropy) Prove the claim in Eq. (38.22).


Hint Make use of the result in Exercise 14.12.

Free download pdf