35.3 1 -armed bandits 416
conservative confidence interval in Eq. (35.4), which makes it stop consistently
later than its Bayesian counterpart, which may be thought as an unpleasant
side-effect.
0. 2 0. 4 0. 6 0. 8
0
2
4
6
8
μ 1
Expected regret
Bayesian optimal
Frequentist
Figure 35.1The plot shows the expected regret for the Bayesian optimal algorithm
compared to the ‘frequestist’ algorithm in Eq. (35.4) on the Bernoulli 1-armed bandit
whereμ 2 = 1/2 andμ 1 varies on thex-axis. The horizontal lines show the average
regret for each algorithm with respect to the prior, which is uniform.
35.3.2 Gaussian rewards
Let (E,G) = (R,B(R)), where forν∈Rwe letPν 1 =N(ν,1) andPν 2 =δμ 2 for
μ 2 ∈Rfixed. Then letQ=N(μP,σP^2 ) be a Gaussian prior with meanμP∈R
and varianceσP^2 >0. By the results in Section 34.3, the posteriorQ(·|x 1 ,...,xt)
after observing rewardsx 1 ,...,xtfrom the first arm is almost surely Gaussian
with meanμtand varianceσ^2 tgiven by
μt=
μP
σ^2 P+
∑t
s=1xs
1 +σ−P^2
and σt^2 =
(
t+
1
σP^2
)− 1
. (35.5)
The posterior variance is independent of the observations, so the posterior is
determined entirely by its mean. As in the Bernoulli case, there exist functions
(wt)nt=1+1 such thatWt = wt(μt− 1 ) almost surely for all t ∈ [n]. Precisely,
wn+1(μ) = 0 and fort≤n,
wt(μ) = max
(
(n−t+ 1)μ 2 , μ+
1
√
2 π
∫∞
−∞
exp
(
−
x^2
2 σ^2 t− 1
)
wt+1(μ+x)dx
)
.
(35.6)
The integral on the right-hand side does not have a closed form solution, which
forces the use of approximate methods. Fortunatelywtis a well behaved function
and can be efficiently approximated. The favorable properties are summarized in
the next lemma, the proof of which is left for you in Exercise 35.8.