Bandit Algorithms

35.6 Notes 424

All this suggests an induction approach where the Gittins index is calculated
for each state in decreasing order of their indices. To get started note that the
maximum possible Gittins index ismaxiriand that this is achievable for state
i=argmaxjrjwith deterministic stopping timeτ = 2. Assume thatg(i) is
known for thejstatesC={i 1 ,i 2 ,...,ij}with the largest Gittins indices. Then
ij+1is given by

ij+1= argmaxi/∈C

e>i(I−αQC)−^1 r
e>i(I−αQC)−^11
If Gauss–Jordan elimination is used for matrix inversion, then the computational
complexity of this algorithm isO(|S|^4 ). A more sophisticated inversion algorithm
would reduce the complexity toO(|S|3+ε) for someε≤ 0 .373, but these are
seldom practical. Whenαis relatively small the inversion can be replaced by
directly calculating the sums to some truncated horizon with little loss in accuracy.

35.6 Notes

1 An advantage of Bayesian methods is that they automatically and optimally exploit the assumptions encoded in the prior. This blessing can also be a curse. A policy that exploits its assumptions too heavily can be brittle when those assumptions turn out to be wrong. This can have a devastating effect in bandits where the cost of overly aggressive confidence intervals is large. 2 The solution to optimal stopping problems is essentially a form ofdynamic programming, which is a method that trades memory for computation by introducing recursively defined value functions that suffice for reconstructing an optimal policy. In the 1-armed bandit optimal stopping problem, thanks to the factorization lemma (Lemma 2.5), for any 0≤t≤nthere exist a function wt:Rt→Rsuch thatWt=wt(X 1 ,...,Xt) almost surely. This function can be seen as the value function that captures the optimal value-to-go from stage ton and(35.3)gives a recursive construction for it:wn(x 1 ,...,xn) = 0 and fort < n,

wt(x 1 ,...,xt) = max((n−t)μ 2 ,

∫

xt+1+wt+1(x 1 ,...,xt,xt+1)dPt(xt+1)),

wherePtis the distribution ofXt+1 givenX 1 ,...,Xt. The problem with this general recursion is that the computation is prohibitive. The example with Bernoulli rewards shows that sometimes a similar recursion holds on a reduced ‘state space’ that avoids the combinatorial explosion that typically arises. For Gaussian rewards even the reduced ‘state space’ was uncountably large and a piecewise quadratic approximation was suggested. When this kind of approximation is used, we get an instance ofapproximate dynamic programming. 3 Discounted bandits with Markov payoffs (Fig. 35.2) are a special case of

Bandit Algorithms

35.6 Notes

∫

Get our desktop app

Company

Features

Documentation

Resources