Bandit Algorithms

(Jeff_L) #1
35.3 1 -armed bandits 416

conservative confidence interval in Eq. (35.4), which makes it stop consistently
later than its Bayesian counterpart, which may be thought as an unpleasant
side-effect.

0. 2 0. 4 0. 6 0. 8

0

2

4

6

8

μ 1

Expected regret

Bayesian optimal
Frequentist

Figure 35.1The plot shows the expected regret for the Bayesian optimal algorithm
compared to the ‘frequestist’ algorithm in Eq. (35.4) on the Bernoulli 1-armed bandit
whereμ 2 = 1/2 andμ 1 varies on thex-axis. The horizontal lines show the average
regret for each algorithm with respect to the prior, which is uniform.

35.3.2 Gaussian rewards


Let (E,G) = (R,B(R)), where forν∈Rwe letPν 1 =N(ν,1) andPν 2 =δμ 2 for
μ 2 ∈Rfixed. Then letQ=N(μP,σP^2 ) be a Gaussian prior with meanμP∈R
and varianceσP^2 >0. By the results in Section 34.3, the posteriorQ(·|x 1 ,...,xt)
after observing rewardsx 1 ,...,xtfrom the first arm is almost surely Gaussian
with meanμtand varianceσ^2 tgiven by

μt=

μP
σ^2 P+

∑t
s=1xs
1 +σ−P^2

and σt^2 =

(


t+

1


σP^2

)− 1


. (35.5)


The posterior variance is independent of the observations, so the posterior is
determined entirely by its mean. As in the Bernoulli case, there exist functions
(wt)nt=1+1 such thatWt = wt(μt− 1 ) almost surely for all t ∈ [n]. Precisely,
wn+1(μ) = 0 and fort≤n,

wt(μ) = max

(


(n−t+ 1)μ 2 , μ+

1



2 π

∫∞


−∞

exp

(



x^2
2 σ^2 t− 1

)


wt+1(μ+x)dx

)


.


(35.6)


The integral on the right-hand side does not have a closed form solution, which
forces the use of approximate methods. Fortunatelywtis a well behaved function
and can be efficiently approximated. The favorable properties are summarized in
the next lemma, the proof of which is left for you in Exercise 35.8.
Free download pdf