Bandit Algorithms

(Jeff_L) #1
35.5 Computing the Gittins index 423

any policyπ,


[∞



t=1

αtr(SAt(t))

]


≤Eπ

[∞



t=1

αtGAt(t)

]


=Eπ

[n

t=1

αtIt(H,A)

]


≤Eπ

[n

t=1

αtI∗t(H)

]


,


where the last line follows from Lemma 35.10. Finally note that the law ofH
underPπdoes not depend onπand hence



[n

t=1

αtIt∗(H)

]


=Eπ∗

[n

t=1

αtIt∗(H)

]


Therefore for allπ,


Eπ∗

[∞



t=1

αtr(SAt(t))

]


≥Eπ

[∞



t=1

αtr(SAt(t))

]


,


which completes the proof.


35.5 Computing the Gittins index


We describe a simple approach that depends on the state space being finite.
References to more general methods are given in the bibliographic remarks.
Assume without loss of generality thatS={ 1 , 2 ,...,|S|}andG= 2S. The matrix
form of the transition kernel isP∈[0,1]|S|×|S|and is defined byPij=μ(i,{j}).
We also letr∈[0,1]|S|be the vector of rewards so thatri=r(i). The standard
basis vector isei∈R|S|and 1 ∈R|S|is the vector with 1 in every coordinate.
ForC⊂Swe letQCbe the transition matrix with (QC)ij=PijIC(j). For each
i∈Sour goal is to find


g(i) = sup
τ≥ 2

Ei

[∑τ− 1
t=1αtr(St)

]


Ei

[∑τ− 1
t=1αt

] ,


whereEiis the expectation with respect the measurePifor which the initial state
isS 1 =i. Lemma 35.7 shows that the stopping timeτ=min{t≥2 :g(St)≤g(i)}
attains the supremum in the above display. The setCi={j:g(j)> g(i)}is
called the continuation region andSi=S\Ciis the stopping region. Then the
Gittins index can be calculated as


g(i) =

Ei

[∑τ− 1
t=1αtr(St)

]


Ei

[∑τ− 1
t=1αt

] =


∑∞


t=1α

te>
iQ

t− 1
∑ Cir

t=1αte>iQ

t− 1
Ci^1

=


e>i(I−αQCi)−^1 r
e>i(I−αQCi)−^11

.

Free download pdf