35.4 Gittins index 420
Observe statesS 1 (t),... , Sk(t)
Choose actionAt ∈ [k]
Receive rewardr(SAt(t))
UpdateSi(t+ 1) =Si(t) for
i 6 =At, andSAt(t+ 1)∼PSAt(t)(·)
Incrementt
t= 1 and initializeS 1 (1),... , Sk(1)
Figure 35.2Interaction protocol for discounted bandits with Markov payoffs.
the initial state of each Markov chain beSi(1) = (1,1) and define Markov kernel
{Ps:s∈S}from (S,G) to itself by
P(x,y)(A) =
x
x+yIA((x+ 1,y)) +
y
x+yIA((x,y+ 1)).
The reward function isr(x,y) =x/(x+y). The reader should check that this
corresponds to a Bernoulli bandit withBeta(1,1) prior on the mean reward of
each arm (Exercise 35.6). The role of the state-space is to maintain a sufficient
statistic for the posterior while the reward function is the expected reward given
the posterior.
Returning to the general problem, a policy π∗ that chooses At ∈
argmaxi∈[k]g(Si(t)) is called aGittins index policy. One of the most celebrated
theorems in the study of bandits is that these policies are Bayesian optimal.
Theorem35.9. Letπ∗be a policy choosingAt=argmaxig(Si(t))with ties
broken arbitrarily. Then
Eπ∗
[∞
∑
t=1
αtr(SAt(t))
]
= sup
π
Eπ
[∞
∑
t=1
αtr(SAt(t))
]
,
where the supremum is taken over al l policies.
The remainder of the section is devoted to proving Theorem 35.9. The choice of
actions produces an interleaving of the rewards generated by each Markov chain
and it will be useful to have a notation for these interleavings. For eachi∈[k]
letgi= (git)∞t=1be a real-valued sequence andg= (g 1 ,...,gk) be the tuple of
these sequences. Given an infinite sequence (at)∞t=1taking values in [k] define the
interleaving sequenceI(g,a) = (It(g,a))∞t=1by
It(g,a) =gat,1+nat(a,t−1) with ni(a,t−1) =
∑t−^1
s=1
I{as=i}.