Bandit Algorithms

(Jeff_L) #1
36.4 Information theoretic analysis 437

Putting together the pieces shows that

BRn≤2 + 2


2 dnβ^2 log

(


1 +


nS^2 L^2
d

)


36.3.1 Computation


An implementation of Thompson sampling for linear bandits needs to sampleθt
from the posterior and then find the optimal action for the sampled parameter:

At= argmaxa∈A〈a,θt〉.

For some priors and noise models sampling from the posterior is straightforward.
The most notable case is whenQis a multivariate Gaussian and the noise is
Gaussian with a known variance. More generally, there is a large literature devoted
to numerical methods for sampling from posterior distributions. Having sampled
θt, findingAtis a linear optimization problem. By comparison, LinUCB needs to
solve

At= argmaxa∈Amax ̃
θ∈C

〈a,θ ̃〉,

which for large or continuous action sets is often intractable.

36.4 Information theoretic analysis


In this section we examine a Bayesian version of adversarialk-armed bandits,
where the Bayesian environment corresponds to a distribution over reward
matrices in [0,1]n×kand the goal is to compete with the best action in hindsight.
As we will see, the natural generalization of Thompson sampling is still a
reasonable algorithm. This setup is also interesting because it allows us to
explore another approach that relies exclusively on information theory. We start
with a few more concepts from information theory.

36.4.1 Mutual information


This section introduces the minimum necessary definitions. Extensions and
references are given in the notes at the end of the chapter. LetXbe a discrete
random variable on probability space (Ω,F,P). The entropy ofXis

H(X) =


x∈range(X)

P(X=x) log

(


1


P(X=x)

)


.


This notation is standard, but note thatH(X) depends onPand is equal to
H(X) =H(PX) (see Chapter 14). When the underlying measure is not clear
Free download pdf