36.4 Information theoretic analysis 437Putting together the pieces shows thatBRn≤2 + 2√
2 dnβ^2 log(
1 +
nS^2 L^2
d)
36.3.1 Computation
An implementation of Thompson sampling for linear bandits needs to sampleθt
from the posterior and then find the optimal action for the sampled parameter:At= argmaxa∈A〈a,θt〉.For some priors and noise models sampling from the posterior is straightforward.
The most notable case is whenQis a multivariate Gaussian and the noise is
Gaussian with a known variance. More generally, there is a large literature devoted
to numerical methods for sampling from posterior distributions. Having sampled
θt, findingAtis a linear optimization problem. By comparison, LinUCB needs to
solveAt= argmaxa∈Amax ̃
θ∈C〈a,θ ̃〉,which for large or continuous action sets is often intractable.36.4 Information theoretic analysis
In this section we examine a Bayesian version of adversarialk-armed bandits,
where the Bayesian environment corresponds to a distribution over reward
matrices in [0,1]n×kand the goal is to compete with the best action in hindsight.
As we will see, the natural generalization of Thompson sampling is still a
reasonable algorithm. This setup is also interesting because it allows us to
explore another approach that relies exclusively on information theory. We start
with a few more concepts from information theory.36.4.1 Mutual information
This section introduces the minimum necessary definitions. Extensions and
references are given in the notes at the end of the chapter. LetXbe a discrete
random variable on probability space (Ω,F,P). The entropy ofXisH(X) =∑
x∈range(X)P(X=x) log(
1
P(X=x))
.
This notation is standard, but note thatH(X) depends onPand is equal to
H(X) =H(PX) (see Chapter 14). When the underlying measure is not clear