Bandit Algorithms

(Jeff_L) #1
35.4 Gittins index 418

35.4.1 A discounted retirement game


We start by describing the discounted setting with one action and then
generalize to multiple actions. Besides discounting, another change is that the
reward generating process is made into aMarkov reward process, a strict
generalization of the previous case. Thestate spaceof the Markov reward
process is chosen to be a Borel space (S,G) and we let{Px:x∈S}be a Markov
kernel from (S,G) to itself. Givenx∈ Slet (Ω,F,Px) be a probability space
carrying a sequence of random elements (Sn)∞n=1withSn∈Ssuch that
(a)Px(S 1 =x) = 1.
(b)Px(Sn+1∈·|Sn) =PSn(·) withPx-probability one.
Expectations with respect toPxare denoted byEx. Next letγ∈Randr:S →R
be aG/B(R)-measurable, both of which are known to the learner. In each round
t= 1, 2 ,...the learner observes the stateStand chooses one of two options:(a)
to retire and end the game or(b)pay the fixed costγto receive a reward ofr(St)
and continue for another round. The policy of a learner in this game corresponds
to choosing a stopping timeτ with respect to the filtrationF= (Ft)twith
Ft=σ(S 1 ,...,St), whereτ=tmeans that the learner retires after observing
Stat the start of roundt. Theα-discounted value of the game when starting in
stateS 1 =xis

vγ(x) = sup
τ≥ 1

Ex

[τ− 1

t=1

αt(r(St)−γ)

]


,


whereα∈(0,1) is thediscount factor. To ensure that this is well defined we
need an assumption.

Assumption35.6.For allx∈Sit holds thatEx

[∞



t=1

αt|r(St)|

]


<∞.


The presence of discounting encourages the learner to obtain large rewards
earlier rather than later and is one distinction between this model and the finite-
horizon model studied for most of this book. A brief discussion of discounting is
left for the notes.
Fix a statex∈S. The mapγ7→vγ(x) is decreasing and is always nonnegative.
In fact, ifγis large enough, it is easy to see that retiring immediately (τ= 1)
achieves the supremum in the definition ofvγ(x) and thusvγ(x) = 0. TheGittins
indexorfair chargeof a statexis the smallest value ofγfor which the learner
is indifferent between retiring immediately and playing for at least one round:
g(x) = inf{γ∈R:vγ(x) = 0}. (35.7)
Straightforward manipulation (Exercise 35.4) shows that

g(x) = sup
τ≥ 2

Ex

[∑τ− 1
t=1αtr(St)

]


Ex

[∑τ− 1
t=1αt

]. (35.8)

Free download pdf