Bandit Algorithms

35.4 Gittins index 422

By definition we have

Eπ

[∞

∑

t=1

αt(r(Si(t))−Gi(t))I{At=i}

]

=

∑∞

j=1

Eπ





∑

t∈Tj

αt(r(Si(t))−γj)



.

The claim follows by showing the term inside the sum on the right-hand side
vanishes for the Gittins index policy and is not positive for any other policy. By
definition, fort∈Tjit holds thatgi(Si(t))≥Gi(t) =γj. Combining this with
Part (b) of Lemma 35.7,

vγj(Si(t)) +γj−r(Si(t)) =α

∫

S

vγj(y)PSi(t)(dy) =αEπ[vγj(Si(t+ 1))|Ft].

From this it follows that

Eπ





∑

t∈Tj



≤Eπ





∑

t∈Tj

αt

(

vγj(Si(t))−αvγj(Si(t+ 1))

)



≤ 0 ,

where the final inequality holds sincevγjis nonnegative andvγj(Si(τj)) = 0.
We now argue that the inequality is replaced by an equality for the Gittins
index policy. The key observation is that having playedAτj =ithe Gittins
index policy continues playing armiuntilg(Si(t))≤γj, which means that
Tj={τj,τj+ 1,...,κj− 1 }whereκj=min{t > τj:g(Si(t))≤γj}, which by
Part (c) of Lemma 35.7 means that

Eπ∗





∑

t∈Tj



= 0.

Part 2: Interleaving prevailing charges LetHiu=minv≤ug(Siv). The key point is that the distribution ofH= (Hiu) does not depend on the choice of policy and clearlyHiuis decreasing inufor eachi. Suppose thatπis the Gittins index policy. Then

Eπ∗

[∞

∑

t=1

αtr(SAt(t))

]

=Eπ∗

[∞

∑

t=1

αtGAt(t)

]

=Eπ∗

[∞

∑

t=1

αtIt(H,A)

]

=Eπ∗

[∞

∑

t=1

αtI∗t(H)

]

,

where the first equality follows from Part 1, the second by the definition ofIt
andHand the third by the definitions of the Gittings index policy, which always
chooses an action that maximizes the prevailing charge. On the other hand, for

Bandit Algorithms

[∞

∑

]

=

∑∞





∑



.

∫





∑







∑

(

)



≤ 0 ,





∑



= 0.

[∞

∑

]

[∞

∑

]

[∞

∑

]

[∞

∑

]

,

Get our desktop app

Company

Features

Documentation

Resources