Bandit Algorithms

35.5 Computing the Gittins index 423

any policyπ,

Eπ

[∞

∑

t=1

αtr(SAt(t))

]

≤Eπ

[∞

∑

t=1

αtGAt(t)

]

=Eπ

[n ∑

t=1

αtIt(H,A)

]

≤Eπ

[n ∑

t=1

αtI∗t(H)

]

,

where the last line follows from Lemma 35.10. Finally note that the law ofH
underPπdoes not depend onπand hence

Eπ

[n ∑

t=1

αtIt∗(H)

]

=Eπ∗

[n ∑

t=1

αtIt∗(H)

]

Therefore for allπ,

Eπ∗

[∞

∑

t=1

αtr(SAt(t))

]

≥Eπ

[∞

∑

t=1

αtr(SAt(t))

]

,

which completes the proof.

35.5 Computing the Gittins index

We describe a simple approach that depends on the state space being finite.
References to more general methods are given in the bibliographic remarks.
Assume without loss of generality thatS={ 1 , 2 ,...,|S|}andG= 2S. The matrix
form of the transition kernel isP∈[0,1]|S|×|S|and is defined byPij=μ(i,{j}).
We also letr∈[0,1]|S|be the vector of rewards so thatri=r(i). The standard
basis vector isei∈R|S|and 1 ∈R|S|is the vector with 1 in every coordinate.
ForC⊂Swe letQCbe the transition matrix with (QC)ij=PijIC(j). For each
i∈Sour goal is to find

g(i) = sup τ≥ 2

Ei

[∑τ− 1 t=1αtr(St)

]

Ei

[∑τ− 1 t=1αt

] ,

whereEiis the expectation with respect the measurePifor which the initial state
isS 1 =i. Lemma 35.7 shows that the stopping timeτ=min{t≥2 :g(St)≤g(i)}
attains the supremum in the above display. The setCi={j:g(j)> g(i)}is
called the continuation region andSi=S\Ciis the stopping region. Then the
Gittins index can be calculated as

g(i) =

Ei

[∑τ− 1 t=1αtr(St)

]

Ei

[∑τ− 1 t=1αt

] =

∑∞

t=1α

te> iQ

t− 1 ∑ Cir ∞ t=1αte>iQ

t− 1 Ci^1

=

e>i(I−αQCi)−^1 r e>i(I−αQCi)−^11

.

Bandit Algorithms

[∞

∑

]

[∞

∑

]

]

]

,

]

]

[∞

∑

]

[∞

∑

]

,

35.5 Computing the Gittins index

]

] ,

]

] =

∑∞

=

.

Get our desktop app

Company

Features

Documentation

Resources