Bandit Algorithms

35.3 1 -armed bandits 416

conservative confidence interval in Eq. (35.4), which makes it stop consistently later than its Bayesian counterpart, which may be thought as an unpleasant side-effect.

0. 2 0. 4 0. 6 0. 8

0

2

4

6

8

μ 1

Expected regret

Bayesian optimal Frequentist

Figure 35.1The plot shows the expected regret for the Bayesian optimal algorithm compared to the ‘frequestist’ algorithm in Eq. (35.4) on the Bernoulli 1-armed bandit whereμ 2 = 1/2 andμ 1 varies on thex-axis. The horizontal lines show the average regret for each algorithm with respect to the prior, which is uniform.

35.3.2 Gaussian rewards

Let (E,G) = (R,B(R)), where forν∈Rwe letPν 1 =N(ν,1) andPν 2 =δμ 2 for μ 2 ∈Rfixed. Then letQ=N(μP,σP^2 ) be a Gaussian prior with meanμP∈R and varianceσP^2 >0. By the results in Section 34.3, the posteriorQ(·|x 1 ,...,xt) after observing rewardsx 1 ,...,xtfrom the first arm is almost surely Gaussian with meanμtand varianceσ^2 tgiven by

μt=

μP σ^2 P+

∑t s=1xs 1 +σ−P^2

and σt^2 =

(

t+

1

σP^2

)− 1

. (35.5)

The posterior variance is independent of the observations, so the posterior is determined entirely by its mean. As in the Bernoulli case, there exist functions (wt)nt=1+1 such thatWt = wt(μt− 1 ) almost surely for all t ∈ [n]. Precisely, wn+1(μ) = 0 and fort≤n,

wt(μ) = max

(

(n−t+ 1)μ 2 , μ+

1

√

2 π

∫∞

−∞

exp

(

−

x^2 2 σ^2 t− 1

)

wt+1(μ+x)dx

)

.

(35.6)

The integral on the right-hand side does not have a closed form solution, which forces the use of approximate methods. Fortunatelywtis a well behaved function and can be efficiently approximated. The favorable properties are summarized in the next lemma, the proof of which is left for you in Exercise 35.8.

Bandit Algorithms

(

1

)− 1

. (35.5)

(

1

√

∫∞

(

−

)

)

.

(35.6)

Get our desktop app

Company

Features

Documentation

Resources