35.5 Computing the Gittins index 423
any policyπ,
Eπ
[∞
∑
t=1
αtr(SAt(t))
]
≤Eπ
[∞
∑
t=1
αtGAt(t)
]
=Eπ
[n
∑
t=1
αtIt(H,A)
]
≤Eπ
[n
∑
t=1
αtI∗t(H)
]
,
where the last line follows from Lemma 35.10. Finally note that the law ofH
underPπdoes not depend onπand hence
Eπ
[n
∑
t=1
αtIt∗(H)
]
=Eπ∗
[n
∑
t=1
αtIt∗(H)
]
Therefore for allπ,
Eπ∗
[∞
∑
t=1
αtr(SAt(t))
]
≥Eπ
[∞
∑
t=1
αtr(SAt(t))
]
,
which completes the proof.
35.5 Computing the Gittins index
We describe a simple approach that depends on the state space being finite.
References to more general methods are given in the bibliographic remarks.
Assume without loss of generality thatS={ 1 , 2 ,...,|S|}andG= 2S. The matrix
form of the transition kernel isP∈[0,1]|S|×|S|and is defined byPij=μ(i,{j}).
We also letr∈[0,1]|S|be the vector of rewards so thatri=r(i). The standard
basis vector isei∈R|S|and 1 ∈R|S|is the vector with 1 in every coordinate.
ForC⊂Swe letQCbe the transition matrix with (QC)ij=PijIC(j). For each
i∈Sour goal is to find
g(i) = sup
τ≥ 2
Ei
[∑τ− 1
t=1αtr(St)
]
Ei
[∑τ− 1
t=1αt
] ,
whereEiis the expectation with respect the measurePifor which the initial state
isS 1 =i. Lemma 35.7 shows that the stopping timeτ=min{t≥2 :g(St)≤g(i)}
attains the supremum in the above display. The setCi={j:g(j)> g(i)}is
called the continuation region andSi=S\Ciis the stopping region. Then the
Gittins index can be calculated as
g(i) =
Ei
[∑τ− 1
t=1αtr(St)
]
Ei
[∑τ− 1
t=1αt
] =
∑∞
t=1α
te>
iQ
t− 1
∑ Cir
∞
t=1αte>iQ
t− 1
Ci^1
=
e>i(I−αQCi)−^1 r
e>i(I−αQCi)−^11