Bandit Algorithms

(Jeff_L) #1
36.7 Exercises 447

The result continues to hold ifQis the space of all probability measures
on (E,B(E)). This formulation avoids some measurability annoyances that
arise becausex7→Rn(π,x) need not beB(E)-measurable for all adversarial
bandit policies. The exercise shows that a uniform bound on the Bayesian
regret for all priors with finite support implies the same bound on the
minimax regret.

36.13(Binary is the worst case) Prove thatR∗n({ 0 , 1 }n×k) =R∗n([0,1]n×k).

Hint Explain how to use a minimax optimal policy for{ 0 , 1 }n×kfor bandits
in [0,1]n×k.
36.14(Implementation) In this exercise you will reproduce the results in
Experiment 1.

(a)Implement Thompson sampling as described in Theorem 36.3 as well as UCB
and AdaUCB.
(b) Reproduce the figures in Experiment 1 as well as UCB.
(c)How consistent are these results across different bandits? Run a few
experiments and report the results.
(d) Explain your findings. Which algorithm do you prefer and why.


36.15(Implementation (ii)) Implement Linear Thompson sampling with a
Gaussian prior as defined in Note 6 as well as LinUCB from Chapter 19 and
Algorithm 12. Compare these algorithms in a variety of regimes, report your
results and tell an interesting story. Discuss the pros and cons of different choices
ofr.


36.16(Misspecified prior) Fix a Gaussian bandit with unit variance and
mean vectorμ= (0, 1 /10) and horizonn= 1000. Now consider Thompson
sampling with a Gaussian model with known unit covariance and a prior on the
unknown mean of each arm given by a Gaussian distribution with meanμPand
covarianceσ^2 PI.

(a)Let the prior mean beμP= (0,0) and plot the regret of Thompson sampling
as a function of the prior varianceσ^2 P.
(b) Repeat the above withμP= (0, 1 /10) and (0,− 1 /10) and (2/ 10 , 1 /10).
(c) Explain your results.

Free download pdf