Bandit Algorithms

(Jeff_L) #1
35.4 Gittins index 419

The form in(35.8)will be useful for computation. It is not immediately clear that
a stopping time attaining the supremum in(35.8)exists. The following lemma
shows that it does and gives an explicit form.
Lemma35.7.The following hold:

(a)vγ(x) = max{ 0 ,r(x)−γ+α


Svγ(y)Px(dy)for allγ∈R.
(b) Ifγ≤g(x), thenvγ(x) =r(x)−γ+α


Svγ(y)Px(dy).
(c)The stopping timeτ =min{t≥2 :g(St)≤γ}attains the supremum in
Eq.(35.8).

The result is relatively intuitive. The Gittins index represents the price the
learner should be willing to pay for the privilege of continuing to play. The optimal
policy continues to play as long as the actual value of the game is not smaller
than this price was at the start. The proof of Lemma 35.7 uses Theorem 35.3
and is left for the reader in Exercise 35.5.

35.4.2 Discounted bandits and the index theorem


The generalization of the discounted retirement game to multiple arms is quite
straightforward. There are nowkindependent Markov chains on the same state-
space and in each roundtthe learner first observes the state of all chains
S 1 (t),...,Sk(t) and chooses an actionAt∈[k]. Because the learner observes the
state of all chains in each round, a policyπis a collection (πt)∞t=1whereπtis
a probability kernel from (Sk×[k])t−^1 ×Skto [k]. After choosing their action
At, the learner receives a rewardr(SAt(t)) from the corresponding chain, which
then evolves randomly to a new state sampled from the Markov kernel. The
states for unplayed actions do not change and we assume that all chains evolve
according to the same Markov kernel; see Fig. 35.2 and the note following it.
Given a discount rateα∈(0,1) the objective is to find the policy maximizing
the cumulative discounted reward:

argmaxπEπ

[∞



t=1

αtr(SAt(t))

]


,


where the expectation is taken with respect to the distribution on state/action
sequences induced by the interaction ofπand thekMarkov chains.

The assumption that the Markov chains evolve on the same state-space with
the same transition kernel is non-restrictive since the state-space can always
be taken to be the union ofkstate-spaces and the transition kernel defined
withkdisconnected components.

Example35.8. To see the relation to Bayesian bandits with discounted rewards
consider the following setup. LetS= [0,∞)×[0,∞) andG=B(S). Then let
Free download pdf