Bandit Algorithms

(Jeff_L) #1
35.4 Gittins index 421

Note that this is the same as the ‘stacking model’ of bandits mentioned on page 63
in Chapter 4 except that here we have fixed sequences. The next lemma follows
from the Hardy–Littlewood inequality, a generalization of the trivial observation
that the identical ordering of two sequences of numbers maximizes their inner
product. We leave the proof to Exercise 35.7.
Lemma35.10. Suppose thatgiis decreasing for al li∈[k]and(a∗t)∞t=1is defined
inductively bya∗t=argmaxigi,1+ni(a∗,t−1)andI∗(g) =I(g,a∗). Then for any
α∈(0,1),
∑∞


t=1

αtIt∗(g) = sup
a

∑∞


t=1

αtIt(g,a).

Proof of Theorem 35.9 Given a policy π = (πt)∞t=1, let (Ω,F,Pπ) be a
probability space carrying random elementsS 1 ,...,Sk whereSi= (Siu)∞u=1
is a sequence of states and (At)∞t=1is a sequence of actions such that:

(a)Pπ(Si,t+1∈·|Sit) =PSit(·).
(b) The sequencesSiandSjare independent for alli 6 =j.
(c)Pπ(At∈ ·|S(1),A 1 ,...,At− 1 ,S(t)) =πt(·|S(1),A 1 ,...,At− 1 ,S(t)), where
Si(t) =Si(1+Ti(t−1))is the state of machineiobserved by the learner at the
start of roundt.
Let Ft = σ(S(1),A 1 ,S(2),...,At− 1 ,S(t),At) be the σ-algebra containing
information available to the learner after choosing their action in roundt. As
usual,Eπdenotes the expectation with respect toPπ.
Theprevailing chargeof armiin roundtis a random variableGi(t) =
mins≤tg(Si(s)). The proof is decomposed into two steps. In the first step, we
relate the prevailing charge to the discounted cumulative reward. The second
step completes the proof by combining the first with an interleaving argument
using Lemma 35.10.


Part 1: The prevailing charge
Fix an armi. We claim that


[∞



t=1

αtr(Si(t))I{At=i}

]


≤Eπ

[∞



t=1

αtGi(t)I{At=i}

]


.


Furthermore, equality holds for theπ=π∗. Letτ 1 ,τ 2 ,...be a sequence of
stopping times defined inductively by

τ 1 = min{t:At=i} and
τj+1= min{t > τj:At=iandg(Si(t))≤Gi(t−1)},

where the minimum of the empty set is defined to be infinite. The definition
ensures on the event{τj<∞}thatg(Si(τj)) =Gi(τj). Next let


Tj={t:At=iandτj≤t < τj+1} and γj=Gi(τj).
Free download pdf