Bandit Algorithms

38.10 Exercises 512

(i)We redefine the regret as follows:R′n =E

[∑τ+n− 1 t=τ rAt(St)−nρ∗(MI)

]

. Show that ifMis strongly connected thenRn=Rn′. (j)Show that there exists a learner such that(38.12)continues to hold in the sense thatR′n≤1 +CE

[

D(MI)|CI|

√

2Anlog(n)

]

.

The logic of the regret definition in Part (i) is that by Part (d), reasonable policies cannot control which trap they fall into in an MDP that has more than one traps. As such, policies should not be penalized for what trap they fall into. However, once a policy falls into some “trap”, we expect it to start to behave near optimally. What this definition is still lacking is that it is insensitive to how fast a policy gets trapped.

38.30(Chain rule for relative entropy) Prove the claim in Eq. (38.22).

Hint Make use of the result in Exercise 14.12.

Bandit Algorithms

]

[

D(MI)|CI|

√

]

.

Get our desktop app

Company

Features

Documentation

Resources