Bandit Algorithms

35.4 Gittins index 420

Observe statesS 1 (t),... , Sk(t)

Choose actionAt ∈ [k]

Receive rewardr(SAt(t))

UpdateSi(t+ 1) =Si(t) for i 6 =At, andSAt(t+ 1)∼PSAt(t)(·)

Incrementt

t= 1 and initializeS 1 (1),... , Sk(1)

Figure 35.2Interaction protocol for discounted bandits with Markov payoffs.

the initial state of each Markov chain beSi(1) = (1,1) and define Markov kernel {Ps:s∈S}from (S,G) to itself by

P(x,y)(A) =

x x+yIA((x+ 1,y)) +

y x+yIA((x,y+ 1)).

The reward function isr(x,y) =x/(x+y). The reader should check that this
corresponds to a Bernoulli bandit withBeta(1,1) prior on the mean reward of
each arm (Exercise 35.6). The role of the state-space is to maintain a sufficient
statistic for the posterior while the reward function is the expected reward given
the posterior.
Returning to the general problem, a policy π∗ that chooses At ∈
argmaxi∈[k]g(Si(t)) is called aGittins index policy. One of the most celebrated
theorems in the study of bandits is that these policies are Bayesian optimal.

Theorem35.9. Letπ∗be a policy choosingAt=argmaxig(Si(t))with ties
broken arbitrarily. Then

Eπ∗

[∞

∑

t=1

αtr(SAt(t))

]

= sup π

Eπ

[∞

∑

t=1

αtr(SAt(t))

]

,

where the supremum is taken over al l policies.

The remainder of the section is devoted to proving Theorem 35.9. The choice of actions produces an interleaving of the rewards generated by each Markov chain and it will be useful to have a notation for these interleavings. For eachi∈[k] letgi= (git)∞t=1be a real-valued sequence andg= (g 1 ,...,gk) be the tuple of these sequences. Given an infinite sequence (at)∞t=1taking values in [k] define the interleaving sequenceI(g,a) = (It(g,a))∞t=1by

It(g,a) =gat,1+nat(a,t−1) with ni(a,t−1) =

∑t−^1

s=1

I{as=i}.

Bandit Algorithms

[∞

∑

]

[∞

∑

]

,

Get our desktop app

Company

Features

Documentation

Resources