35.6 Notes 424
All this suggests an induction approach where the Gittins index is calculated
for each state in decreasing order of their indices. To get started note that the
maximum possible Gittins index ismaxiriand that this is achievable for state
i=argmaxjrjwith deterministic stopping timeτ = 2. Assume thatg(i) is
known for thejstatesC={i 1 ,i 2 ,...,ij}with the largest Gittins indices. Then
ij+1is given by
ij+1= argmaxi/∈C
e>i(I−αQC)−^1 r
e>i(I−αQC)−^11
If Gauss–Jordan elimination is used for matrix inversion, then the computational
complexity of this algorithm isO(|S|^4 ). A more sophisticated inversion algorithm
would reduce the complexity toO(|S|3+ε) for someε≤ 0 .373, but these are
seldom practical. Whenαis relatively small the inversion can be replaced by
directly calculating the sums to some truncated horizon with little loss in accuracy.
35.6 Notes
1 An advantage of Bayesian methods is that they automatically and optimally
exploit the assumptions encoded in the prior. This blessing can also be a curse.
A policy that exploits its assumptions too heavily can be brittle when those
assumptions turn out to be wrong. This can have a devastating effect in bandits
where the cost of overly aggressive confidence intervals is large.
2 The solution to optimal stopping problems is essentially a form ofdynamic
programming, which is a method that trades memory for computation by
introducing recursively defined value functions that suffice for reconstructing
an optimal policy. In the 1-armed bandit optimal stopping problem, thanks to
the factorization lemma (Lemma 2.5), for any 0≤t≤nthere exist a function
wt:Rt→Rsuch thatWt=wt(X 1 ,...,Xt) almost surely. This function can
be seen as the value function that captures the optimal value-to-go from stage
ton and(35.3)gives a recursive construction for it:wn(x 1 ,...,xn) = 0 and
fort < n,
wt(x 1 ,...,xt) = max((n−t)μ 2 ,
∫
xt+1+wt+1(x 1 ,...,xt,xt+1)dPt(xt+1)),
wherePtis the distribution ofXt+1 givenX 1 ,...,Xt. The problem with
this general recursion is that the computation is prohibitive. The example
with Bernoulli rewards shows that sometimes a similar recursion holds on a
reduced ‘state space’ that avoids the combinatorial explosion that typically
arises. For Gaussian rewards even the reduced ‘state space’ was uncountably
large and a piecewise quadratic approximation was suggested. When this
kind of approximation is used, we get an instance ofapproximate dynamic
programming.
3 Discounted bandits with Markov payoffs (Fig. 35.2) are a special case of