Bandit Algorithms
35.2 Optimal stopping ( ) 412 Theorem35.1.For any priorQ, lim sup n→∞ BR∗n(Q) n^1 /^2 = 0. Furthermore, there exists a priorQsuc ...
35.3 1 -armed bandits 413 This means that (St)tis a Markov chain over the state spaceSwith kernel (Px)x∈SandUt=u(St) is the util ...
35.3 1 -armed bandits 414 These facts allow us to frame the Bayesian 1-armed bandit problem in terms of optimal stopping. Define ...
35.3 1 -armed bandits 415 This should make intuitive sense. It is optimal to continue only if the expected future reward from do ...
35.3 1 -armed bandits 416 conservative confidence interval in Eq. (35.4), which makes it stop consistently later than its Bayesi ...
35.4 Gittins index 417 Lemma35.5.The following hold: (a) The functionwtis increasing. (b) The functionwtis convex. (c)limμ→∞wt(μ ...
35.4 Gittins index 418 35.4.1 A discounted retirement game We start by describing the discounted setting with one action and the ...
35.4 Gittins index 419 The form in(35.8)will be useful for computation. It is not immediately clear that a stopping time attaini ...
35.4 Gittins index 420 Observe statesS 1 (t),... , Sk(t) Choose actionAt ∈ [k] Receive rewardr(SAt(t)) UpdateSi(t+ 1) =Si(t) for ...
35.4 Gittins index 421 Note that this is the same as the ‘stacking model’ of bandits mentioned on page 63 in Chapter 4 except th ...
35.4 Gittins index 422 By definition we have Eπ [∞ ∑ t=1 αt(r(Si(t))−Gi(t))I{At=i} ] = ∑∞ j=1 Eπ ∑ t∈Tj αt(r(Si(t))−γj) . ...
35.5 Computing the Gittins index 423 any policyπ, Eπ [∞ ∑ t=1 αtr(SAt(t)) ] ≤Eπ [∞ ∑ t=1 αtGAt(t) ] =Eπ [n ∑ t=1 αtIt(H,A) ] ≤Eπ ...
35.6 Notes 424 All this suggests an induction approach where the Gittins index is calculated for each state in decreasing order ...
35.6 Notes 425 discounted Markov decision processes on which there is a large literature. More details are in the bibliographic ...
35.7 Bibliographical remarks 426 9 The algorithm in Section 35.5 for computing Gittins index is calledVaraiya’s algorithm. In th ...
35.8 Exercises 427 Although it is now more than thirty years old, this book is still a worthwhile read and presents many curious ...
35.8 Exercises 428 35.2(Finite horizon optimal stopping) Prove Theorem 35.2. Hint Prove that (Et)nt=1is aF-adapted supermartinga ...
This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This p ...
36.1 Finite-armed bandits 430 36.1 Finite-armed bandits Recalling the notation from Section 34.5, letk >1 and (E,B(E),Q,P) be ...
36.2 Frequentist analysis 431 σ(A 1 ,X 1 ,...,At,Xt) be theσ-algebra generated by the interaction sequence by the end of roundt. ...
«
17
18
19
20
21
22
23
24
25
26
»
Free download pdf