Bandit Algorithms

35.3 1 -armed bandits 415

This should make intuitive sense. It is optimal to continue only if the expected future reward from doing so is at least as large as what can be obtained by stopping immediately. The difficulty is thatE[Zt+Wt+1|Ft] can be quite a complicated object. We now give two examples whereE[Zt+Wt+1|Ft] has a simple representation that means computing the optimal stopping rule is practical.

35.3.1 Bernoulli rewards

LetE = [0,1],G=B([0,1]) and forν∈ EletPν 1 =B(ν) andPν 2 =δμ 2. So the first arm is Bernoulli and the second is a Dirac at some fixed value μ 2 ∈ [0,1]. For the prior choose the Q = Beta(α,β), a Beta prior. By the argument in Example 34.6 the posterior at the start of roundtis a Beta distributionBeta(α+St,β+t− 1 −St) whereSt=

∑t− 1 s=1Zs. Letting pt(s) = (α+s)/(α+β+t−1) it follows that

E[Zt|Ft] =

α+St α+β+t− 1

=pt(St),

P(St+1=St+ 1|St) =pt(St), P(St+1=St|St) = 1−pt(St).

Now letwn+1(s) = 0 for allsand

wt(s) = max{(n−t+ 1)μ 2 ,pt(s) +pt(s)wt+1(s+ 1) + (1−pt(s))wt+1(s)}.

ThenWt=wt(St) and hence the optimal policy can be computed by evaluating wt(s) for alls∈{ 0 ,...,t}starting witht=n, thenn−1 and so-on untilt= 1. The total computation for this backwards induction isO(n^2 ) and the output is a policy that can be implemented over allnrounds. By contrast, the typical frequentist stopping rule requires onlyO(n) computations, so the overhead is quite severe. The improvement in terms of the Bayes regret is not insignificant, however, as illustrated by the following experiment. Experiment 35.1 The horizon is set ton= 500 andμ 2 = 1/2. The stopping rules we compare are the Bayesian optimal policy with aBeta(1,1) prior and the ‘frequentist’ stopping rule given by

τ= min

{

t≥2 : ˆμt− 1 < μ 2 andd(ˆμt− 1 ,μ 2 )≥

log(n/t) t− 1

}

, (35.4)

where d(p,q) = D(B(p),B(q)) is the relative entropy between Bernoulli distributions with parameterspandq respectively andμˆt =

∑t s=1Xs/tis the empirical estimate ofμ 1 based on the firsttobservations. Fig. 35.1 shows the expected regret for different values ofμ, with horizontal dotted lines indicating the expected regret averaged over the prior. Note that although the prior is symmetric, the 1-armed bandit problem is not, which explicates the asymmetric behavior of the Bayesian optimal algorithm. The frequentist algorithm is even more asymmetric with very small regret forμ > 1 /2. This is caused by a

Bandit Algorithms

{

}

, (35.4)

Get our desktop app

Company

Features

Documentation

Resources