35.3 1 -armed bandits 415
This should make intuitive sense. It is optimal to continue only if the expected
future reward from doing so is at least as large as what can be obtained by
stopping immediately. The difficulty is thatE[Zt+Wt+1|Ft] can be quite a
complicated object. We now give two examples whereE[Zt+Wt+1|Ft] has a
simple representation that means computing the optimal stopping rule is practical.
35.3.1 Bernoulli rewards
LetE = [0,1],G=B([0,1]) and forν∈ EletPν 1 =B(ν) andPν 2 =δμ 2.
So the first arm is Bernoulli and the second is a Dirac at some fixed value
μ 2 ∈ [0,1]. For the prior choose the Q = Beta(α,β), a Beta prior. By
the argument in Example 34.6 the posterior at the start of roundtis a
Beta distributionBeta(α+St,β+t− 1 −St) whereSt=
∑t− 1
s=1Zs. Letting
pt(s) = (α+s)/(α+β+t−1) it follows that
E[Zt|Ft] =
α+St
α+β+t− 1
=pt(St),
P(St+1=St+ 1|St) =pt(St),
P(St+1=St|St) = 1−pt(St).
Now letwn+1(s) = 0 for allsand
wt(s) = max{(n−t+ 1)μ 2 ,pt(s) +pt(s)wt+1(s+ 1) + (1−pt(s))wt+1(s)}.
ThenWt=wt(St) and hence the optimal policy can be computed by evaluating
wt(s) for alls∈{ 0 ,...,t}starting witht=n, thenn−1 and so-on untilt= 1.
The total computation for this backwards induction isO(n^2 ) and the output is
a policy that can be implemented over allnrounds. By contrast, the typical
frequentist stopping rule requires onlyO(n) computations, so the overhead is
quite severe. The improvement in terms of the Bayes regret is not insignificant,
however, as illustrated by the following experiment.
Experiment 35.1 The horizon is set ton= 500 andμ 2 = 1/2. The stopping
rules we compare are the Bayesian optimal policy with aBeta(1,1) prior and the
‘frequentist’ stopping rule given by
τ= min
{
t≥2 : ˆμt− 1 < μ 2 andd(ˆμt− 1 ,μ 2 )≥
log(n/t)
t− 1
}
, (35.4)
where d(p,q) = D(B(p),B(q)) is the relative entropy between Bernoulli
distributions with parameterspandq respectively andμˆt =
∑t
s=1Xs/tis
the empirical estimate ofμ 1 based on the firsttobservations. Fig. 35.1 shows the
expected regret for different values ofμ, with horizontal dotted lines indicating
the expected regret averaged over the prior. Note that although the prior is
symmetric, the 1-armed bandit problem is not, which explicates the asymmetric
behavior of the Bayesian optimal algorithm. The frequentist algorithm is even
more asymmetric with very small regret forμ > 1 /2. This is caused by a