Bandit Algorithms

36.4 Information theoretic analysis 437

Putting together the pieces shows that

BRn≤2 + 2

√

2 dnβ^2 log

(

1 +

nS^2 L^2 d

)

36.3.1 Computation

An implementation of Thompson sampling for linear bandits needs to sampleθt from the posterior and then find the optimal action for the sampled parameter:

At= argmaxa∈A〈a,θt〉.

For some priors and noise models sampling from the posterior is straightforward. The most notable case is whenQis a multivariate Gaussian and the noise is Gaussian with a known variance. More generally, there is a large literature devoted to numerical methods for sampling from posterior distributions. Having sampled θt, findingAtis a linear optimization problem. By comparison, LinUCB needs to solve

At= argmaxa∈Amax ̃ θ∈C

〈a,θ ̃〉,

which for large or continuous action sets is often intractable.

36.4 Information theoretic analysis

In this section we examine a Bayesian version of adversarialk-armed bandits, where the Bayesian environment corresponds to a distribution over reward matrices in [0,1]n×kand the goal is to compete with the best action in hindsight. As we will see, the natural generalization of Thompson sampling is still a reasonable algorithm. This setup is also interesting because it allows us to explore another approach that relies exclusively on information theory. We start with a few more concepts from information theory.

36.4.1 Mutual information

This section introduces the minimum necessary definitions. Extensions and references are given in the notes at the end of the chapter. LetXbe a discrete random variable on probability space (Ω,F,P). The entropy ofXis

H(X) =

∑

x∈range(X)

P(X=x) log

(

1

P(X=x)

)

.

This notation is standard, but note thatH(X) depends onPand is equal to H(X) =H(PX) (see Chapter 14). When the underlying measure is not clear

Bandit Algorithms

√

(

1 +

)

36.4 Information theoretic analysis

∑

(

1

)

.

Get our desktop app

Company

Features

Documentation

Resources