Bandit Algorithms

(Jeff_L) #1
35.4 Gittins index 422

By definition we have


[∞



t=1

αt(r(Si(t))−Gi(t))I{At=i}

]


=


∑∞


j=1





t∈Tj

αt(r(Si(t))−γj)


.


The claim follows by showing the term inside the sum on the right-hand side
vanishes for the Gittins index policy and is not positive for any other policy. By
definition, fort∈Tjit holds thatgi(Si(t))≥Gi(t) =γj. Combining this with
Part (b) of Lemma 35.7,


vγj(Si(t)) +γj−r(Si(t)) =α


S

vγj(y)PSi(t)(dy) =αEπ[vγj(Si(t+ 1))|Ft].

From this it follows that





t∈Tj

αt(r(Si(t))−γj)


≤Eπ




t∈Tj

αt

(


vγj(Si(t))−αvγj(Si(t+ 1))

)



≤ 0 ,


where the final inequality holds sincevγjis nonnegative andvγj(Si(τj)) = 0.
We now argue that the inequality is replaced by an equality for the Gittins
index policy. The key observation is that having playedAτj =ithe Gittins
index policy continues playing armiuntilg(Si(t))≤γj, which means that
Tj={τj,τj+ 1,...,κj− 1 }whereκj=min{t > τj:g(Si(t))≤γj}, which by
Part (c) of Lemma 35.7 means that


Eπ∗




t∈Tj

αt(r(Si(t))−γj)


= 0.


Part 2: Interleaving prevailing charges
LetHiu=minv≤ug(Siv). The key point is that the distribution ofH= (Hiu)
does not depend on the choice of policy and clearlyHiuis decreasing inufor
eachi. Suppose thatπis the Gittins index policy. Then

Eπ∗

[∞



t=1

αtr(SAt(t))

]


=Eπ∗

[∞



t=1

αtGAt(t)

]


=Eπ∗

[∞



t=1

αtIt(H,A)

]


=Eπ∗

[∞



t=1

αtI∗t(H)

]


,


where the first equality follows from Part 1, the second by the definition ofIt
andHand the third by the definitions of the Gittings index policy, which always
chooses an action that maximizes the prevailing charge. On the other hand, for

Free download pdf