Bandit Algorithms

35.4 Gittins index 418

35.4.1 A discounted retirement game

We start by describing the discounted setting with one action and then generalize to multiple actions. Besides discounting, another change is that the reward generating process is made into aMarkov reward process, a strict generalization of the previous case. Thestate spaceof the Markov reward process is chosen to be a Borel space (S,G) and we let{Px:x∈S}be a Markov kernel from (S,G) to itself. Givenx∈ Slet (Ω,F,Px) be a probability space carrying a sequence of random elements (Sn)∞n=1withSn∈Ssuch that (a)Px(S 1 =x) = 1. (b)Px(Sn+1∈·|Sn) =PSn(·) withPx-probability one. Expectations with respect toPxare denoted byEx. Next letγ∈Randr:S →R be aG/B(R)-measurable, both of which are known to the learner. In each round t= 1, 2 ,...the learner observes the stateStand chooses one of two options:(a) to retire and end the game or(b)pay the fixed costγto receive a reward ofr(St) and continue for another round. The policy of a learner in this game corresponds to choosing a stopping timeτ with respect to the filtrationF= (Ft)twith Ft=σ(S 1 ,...,St), whereτ=tmeans that the learner retires after observing Stat the start of roundt. Theα-discounted value of the game when starting in stateS 1 =xis

vγ(x) = sup τ≥ 1

Ex

[τ− 1 ∑

t=1

αt(r(St)−γ)

]

,

whereα∈(0,1) is thediscount factor. To ensure that this is well defined we need an assumption.

Assumption35.6.For allx∈Sit holds thatEx

[∞

∑

t=1

αt|r(St)|

]

<∞.

The presence of discounting encourages the learner to obtain large rewards earlier rather than later and is one distinction between this model and the finite- horizon model studied for most of this book. A brief discussion of discounting is left for the notes. Fix a statex∈S. The mapγ7→vγ(x) is decreasing and is always nonnegative. In fact, ifγis large enough, it is easy to see that retiring immediately (τ= 1) achieves the supremum in the definition ofvγ(x) and thusvγ(x) = 0. TheGittins indexorfair chargeof a statexis the smallest value ofγfor which the learner is indifferent between retiring immediately and playing for at least one round: g(x) = inf{γ∈R:vγ(x) = 0}. (35.7) Straightforward manipulation (Exercise 35.4) shows that

g(x) = sup τ≥ 2

Ex

[∑τ− 1 t=1αtr(St)

]

Ex

[∑τ− 1 t=1αt

]. (35.8)

Bandit Algorithms

]

,

[∞

∑

]

<∞.

]

]. (35.8)

Get our desktop app

Company

Features

Documentation

Resources