Bandit Algorithms

35.4 Gittins index 421

Note that this is the same as the ‘stacking model’ of bandits mentioned on page 63
in Chapter 4 except that here we have fixed sequences. The next lemma follows
from the Hardy–Littlewood inequality, a generalization of the trivial observation
that the identical ordering of two sequences of numbers maximizes their inner
product. We leave the proof to Exercise 35.7.
Lemma35.10. Suppose thatgiis decreasing for al li∈[k]and(a∗t)∞t=1is defined
inductively bya∗t=argmaxigi,1+ni(a∗,t−1)andI∗(g) =I(g,a∗). Then for any
α∈(0,1),
∑∞

t=1

αtIt∗(g) = sup a

∑∞

t=1

αtIt(g,a).

Proof of Theorem 35.9 Given a policy π = (πt)∞t=1, let (Ω,F,Pπ) be a probability space carrying random elementsS 1 ,...,Sk whereSi= (Siu)∞u=1 is a sequence of states and (At)∞t=1is a sequence of actions such that:

(a)Pπ(Si,t+1∈·|Sit) =PSit(·).
(b) The sequencesSiandSjare independent for alli 6 =j.
(c)Pπ(At∈ ·|S(1),A 1 ,...,At− 1 ,S(t)) =πt(·|S(1),A 1 ,...,At− 1 ,S(t)), where
Si(t) =Si(1+Ti(t−1))is the state of machineiobserved by the learner at the
start of roundt.
Let Ft = σ(S(1),A 1 ,S(2),...,At− 1 ,S(t),At) be the σ-algebra containing
information available to the learner after choosing their action in roundt. As
usual,Eπdenotes the expectation with respect toPπ.
Theprevailing chargeof armiin roundtis a random variableGi(t) =
mins≤tg(Si(s)). The proof is decomposed into two steps. In the first step, we
relate the prevailing charge to the discounted cumulative reward. The second
step completes the proof by combining the first with an interleaving argument
using Lemma 35.10.

Part 1: The prevailing charge Fix an armi. We claim that

Eπ

[∞

∑

t=1

αtr(Si(t))I{At=i}

]

≤Eπ

[∞

∑

t=1

αtGi(t)I{At=i}

]

.

Furthermore, equality holds for theπ=π∗. Letτ 1 ,τ 2 ,...be a sequence of stopping times defined inductively by

τ 1 = min{t:At=i} and τj+1= min{t > τj:At=iandg(Si(t))≤Gi(t−1)},

where the minimum of the empty set is defined to be infinite. The definition
ensures on the event{τj<∞}thatg(Si(τj)) =Gi(τj). Next let

Tj={t:At=iandτj≤t < τj+1} and γj=Gi(τj).

Bandit Algorithms

∑∞

[∞

∑

]

[∞

∑

]

.

Get our desktop app

Company

Features

Documentation

Resources