35.3 1 -armed bandits 416conservative confidence interval in Eq. (35.4), which makes it stop consistently
later than its Bayesian counterpart, which may be thought as an unpleasant
side-effect.0. 2 0. 4 0. 6 0. 802468μ 1Expected regretBayesian optimal
FrequentistFigure 35.1The plot shows the expected regret for the Bayesian optimal algorithm
compared to the ‘frequestist’ algorithm in Eq. (35.4) on the Bernoulli 1-armed bandit
whereμ 2 = 1/2 andμ 1 varies on thex-axis. The horizontal lines show the average
regret for each algorithm with respect to the prior, which is uniform.35.3.2 Gaussian rewards
Let (E,G) = (R,B(R)), where forν∈Rwe letPν 1 =N(ν,1) andPν 2 =δμ 2 for
μ 2 ∈Rfixed. Then letQ=N(μP,σP^2 ) be a Gaussian prior with meanμP∈R
and varianceσP^2 >0. By the results in Section 34.3, the posteriorQ(·|x 1 ,...,xt)
after observing rewardsx 1 ,...,xtfrom the first arm is almost surely Gaussian
with meanμtand varianceσ^2 tgiven byμt=μP
σ^2 P+∑t
s=1xs
1 +σ−P^2and σt^2 =(
t+1
σP^2)− 1
. (35.5)
The posterior variance is independent of the observations, so the posterior is
determined entirely by its mean. As in the Bernoulli case, there exist functions
(wt)nt=1+1 such thatWt = wt(μt− 1 ) almost surely for all t ∈ [n]. Precisely,
wn+1(μ) = 0 and fort≤n,wt(μ) = max(
(n−t+ 1)μ 2 , μ+1
√
2 π∫∞
−∞exp(
−
x^2
2 σ^2 t− 1)
wt+1(μ+x)dx)
.
(35.6)
The integral on the right-hand side does not have a closed form solution, which
forces the use of approximate methods. Fortunatelywtis a well behaved function
and can be efficiently approximated. The favorable properties are summarized in
the next lemma, the proof of which is left for you in Exercise 35.8.