35.4 Gittins index 417
Lemma35.5.The following hold:
(a) The functionwtis increasing.
(b) The functionwtis convex.
(c)limμ→∞wt(μ)/μ=n−t+ 1andlimμ→−∞wt(μ) = (n−t+ 1)μ 2.
There are many ways to approximate a function, but in order to propagate
the approximation using Eq. (35.6) it is convenient to choose a form for which
the integral in Eq. (35.6) can be computed analytically. Given the properties
in Lemma 35.5 a natural choice is to approximatewtusing piecewise quadratic
functions. Let ̃wn+1(μ) = 0 and
w ̄t(μ) = max
{
(n−t+ 1)μ 2 , μ+
1
√
2 π
∫∞
−∞
exp
(
−
x^2
2 σt^2
)
w ̃t+1(μ+x)dx
}
Then let −∞< x 1 ≤x 2 ≤...≤xN <∞and for μ∈[xi,xi+1] define
w ̃t(μ) =aiμ^2 +biμ+cito be the unique quadratic approximation ofw ̄t(μ) such
that
w ̃t(xi) = ̄wt(xi),
w ̃t(xi+1) = ̄wt(xi+1),
w ̃t((xi+xi+1)/2) = ̄wt((xi+xi+1)/2).
Forμ < x 1 we approximatewt(μ) = (n−t+ 1)μ 2 and forμ > xNthe linear
approximationw ̃t(μ) = (n−t+1)μis reasonable by Lemma 35.5. The computation
time for calculating the coefficientsai,bi,cifor alltandi∈[N] isO(Nn). We
encourage the reader to implement this algorithm and compare it to its natural
frequentist competitors (Exercise 35.10).
35.4 Gittins index
Generalizing the analysis in the previous section to multiple actions
is mathematically straightforward, but computationally intractable. The
computational complexity of backwards induction increases exponentially with
the number arms, which is impractical unless the number of arms and horizon
are both small.
Anindex policyis a policy that in each round computes a real-valuedindex
for each arm and plays the arm with the largest index. Furthermore, the index
of an arm should only depend on statistics collected for that arm (and perhaps
the time horizon). Many policies we met earlier are index policies. For example,
most variants of the upper confidence bound algorithm introduced in Part II
are index policies. Sadly, however, the Bayesian optimal policy for finite-horizon
bandits is not usually an index policy. John Gittins proved that if one is prepared
to modify the objective to a special kind of infinite horizon problem, then the
Bayesian optimal policy becomes an index policy. In the remainder of this chapter
we explore his ideas.