Bandit Algorithms
38.6 Proof of upper bound 492 Rearranging and substituting yields R ̃k= ∑ t∈Ek (−vk(St) +〈Pk,At(St),vk〉) = ∑ t∈Ek (−vk(St) +〈PAt ...
38.7 Proof of lower bound 493 The first sum is bounded using a version of Hoeffding–Azuma (Exercise 20.8): P ( Fcand ∑n t=1 (Et[ ...
38.7 Proof of lower bound 494 number of leaves and label these statess 1 ,...,sL. The last two states aresgand sb(‘good’ and ‘ba ...
38.7 Proof of lower bound 495 One could almost claim victory here and not bother with the proof. As usual, however, there are so ...
38.8 Notes 496 whered(p,q) is the relative entropy between Bernoulli distributions with biasesp andqrespectively. Now ∆ will be ...
38.8 Notes 497 2 The key to choosing the state space is that the state must be observable and sufficiently informative that the ...
38.8 Notes 498 optimal if it achieves the supremum in this definition and such a policy always exists as long as the MDP is fini ...
38.8 Notes 499 algorithm halts with some non-emptyIkand that these states are recurrent under some optimal policy. For more deta ...
38.9 Bibliographical remarks 500 〈Pˆk−P,vk〉, where the dependence on the state and action has been omitted. Of course the algori ...
38.9 Bibliographical remarks 501 the fundamental algorithms. A book by one of the present authors focuses on cataloging the rang ...
38.10 Exercises 502 Szepesv ́ari [2011] and Abbasi-Yadkori [2012] give algorithms withO( √ n) regret for linearly parameterized ...
38.10 Exercises 503 distributionμ∈ P(S) show there exists a probability space (Ω,F,P) and an infinite sequence of random element ...
38.10 Exercises 504 d∗(μ 0 ,U) = ∑ sμ^0 (s) ∑ s′∈Ud ∗(s,s′). Prove by induction on the size ofUthat d∗(μ 0 ,U)≥min ∑ k≥ 0 ...
38.10 Exercises 505 (b) Prove that there exists av∈RSsuch thatTγv=v. (c) Letπbe the greedy policy with respect tov. Showv=rπ+γPπ ...
38.10 Exercises 506 are exactly the elements of the set { (ρ,v)∈R×R^2 :ρ= 0, v(1)− 1 ≤v(2)≤v(1) + 1 } . 1 2 r=− 1 r=− 1 r= 0 r= ...
38.10 Exercises 507 38.16(Approximate solutions to Bellman equation) Consider a strongly connected MDP and suppose thatρandvappr ...
38.10 Exercises 508 Hint Since UCRL2 and the environment are both deterministic you can examine the behavior of the algorithm on ...
38.10 Exercises 509 following way. First define the empirical reward at the start of thekth phase by rˆk,a(s) = τ∑k− 1 u=1 I{Su= ...
38.10 Exercises 510 A={left,right}andS= [S] with S≥2. In all statess >1, actionleft deterministically leads to states−1 and p ...
38.10 Exercises 511 (e)Run simulations in RiverSwim instances of various sizes to compare the regret of UCRL2 andε-greedy. What ...
«
19
20
21
22
23
24
25
26
27
28
»
Free download pdf