Bandit Algorithms

36.5 Notes 442

‘approximate’ Thompson sampling and approximating a sample from each of
the above three choices may lead to different algorithms.
2 Thompson sampling is known to be asymptotically optimal in a variety of
settings. Most notably, when the noise model follows a single-parameter
exponential family and the prior is chosen appropriately [Kaufmann et al.,
2012b, Korda et al., 2013]. Unfortunately Thompson sampling is not a silver
bullet. The linear variant in Section 36.3 is not asymptotically optimal by the
same argument we presented for optimism in Chapter 25. Characterizing the
conditions under which Thompson sampling is close to optimal remains an
open challenge.
3 For the Gaussian noise model it is known that Thompson sampling is not
minimax optimal. Its worst-case regret isRn= Θ(

√

nklog(k)) [Agrawal and
Goyal, 2013a].
4 An alternative to sampling from the posterior is to choose in each round
the arm that maximizes aBayesian upper confidence bound, which is a
quantile of the posterior. The resulting algorithm is calledBayesUCBand
has excellent empirical and theoretical guarantees [Kaufmann et al., 2012a,
Kaufmann, 2018].
5 The prior has a significant effect on the performance of Thompson sampling.
In classical Bayesian statistics a poorly chosen prior is quickly washed away
by data. This is not true in bandits because if the prior underestimates the
quality of an arm, then Thompson sampling may never play that arm with high
probability and no data is ever observed. We ask you to explore this situation
in Exercise 36.16.
6 An instantiation of Thompson sampling for linear bandits is known to
enjoy near-optimal frequentist regret. In each round the algorithm samples
θt∼N(θˆt− 1 ,rVt− 1 ), wherer= Θ(d) is a constant and

Vt=I+

∑t

s=1

AsA>s and θˆt=Vt−^1

∑t

s=1

XsAs.

Then At = argmaxa∈A〈θt,a〉. This corresponds to assuming the noise is Gaussian with variancer and choosing priorQ =N(0,I). Provided the rewards are conditionally 1-subgaussian, the frequentist regret of this algorithm isRn=O ̃(d^3 /^2

√

n), which is worse than LinUCB by a factor of

√

d. The
increased regret is caused by the choice of noise model, which assumes the
variance isr= Θ(d) rather thanr= 1. The reason to do this comes from the
analysis, which works by showing the algorithm is ‘optimistic’ with reasonable
probability. It is not known whether or not this is necessary or an artifact of
the analysis. Empirically,r= 1 leads to superior performance andr= Θ(d)
results in quite a large regret even when compared with LinUCB.
7 LetX,Y be random elements over the probability space (Ω,F,P). The
standard definition of mutual information betweenX,Y (relative toP) is
I(X;Y) =D(PX,Y,PX⊗PY), wherePX,Yis the joint distribution of (X,Y)

Bandit Algorithms

√

√

√

Get our desktop app

Company

Features

Documentation

Resources