Bandit Algorithms

35.4 Gittins index 419

The form in(35.8)will be useful for computation. It is not immediately clear that a stopping time attaining the supremum in(35.8)exists. The following lemma shows that it does and gives an explicit form. Lemma35.7.The following hold:

(a)vγ(x) = max{ 0 ,r(x)−γ+α

∫

Svγ(y)Px(dy)for allγ∈R. (b) Ifγ≤g(x), thenvγ(x) =r(x)−γ+α

∫

Svγ(y)Px(dy). (c)The stopping timeτ =min{t≥2 :g(St)≤γ}attains the supremum in Eq.(35.8).

The result is relatively intuitive. The Gittins index represents the price the learner should be willing to pay for the privilege of continuing to play. The optimal policy continues to play as long as the actual value of the game is not smaller than this price was at the start. The proof of Lemma 35.7 uses Theorem 35.3 and is left for the reader in Exercise 35.5.

35.4.2 Discounted bandits and the index theorem

The generalization of the discounted retirement game to multiple arms is quite straightforward. There are nowkindependent Markov chains on the same state- space and in each roundtthe learner first observes the state of all chains S 1 (t),...,Sk(t) and chooses an actionAt∈[k]. Because the learner observes the state of all chains in each round, a policyπis a collection (πt)∞t=1whereπtis a probability kernel from (Sk×[k])t−^1 ×Skto [k]. After choosing their action At, the learner receives a rewardr(SAt(t)) from the corresponding chain, which then evolves randomly to a new state sampled from the Markov kernel. The states for unplayed actions do not change and we assume that all chains evolve according to the same Markov kernel; see Fig. 35.2 and the note following it. Given a discount rateα∈(0,1) the objective is to find the policy maximizing the cumulative discounted reward:

argmaxπEπ

[∞

∑

t=1

αtr(SAt(t))

]

,

where the expectation is taken with respect to the distribution on state/action sequences induced by the interaction ofπand thekMarkov chains.

The assumption that the Markov chains evolve on the same state-space with the same transition kernel is non-restrictive since the state-space can always be taken to be the union ofkstate-spaces and the transition kernel defined withkdisconnected components.

Example35.8. To see the relation to Bayesian bandits with discounted rewards consider the following setup. LetS= [0,∞)×[0,∞) andG=B(S). Then let

Bandit Algorithms

∫

∫

[∞

∑

]

,

Get our desktop app

Company

Features

Documentation

Resources