Bandit Algorithms

36.3 Linear bandits 435

0 100 200 300 400

0

100

200

300 Thompson sampling

Regret

Proportion

×

Regret

2

0 100 200 300 400

0

100

200

300 AdaUCB

Regret

36.3 Linear bandits

While the advantages of Thompson sampling in finite-armed bandits are relatively limited, in the linear setting there is much to be gained, both in terms of computation and empirical performance. LetA⊂Rdand (E,B(E),Q,P) be a Bayesian bandit environment whereE ⊂Rdand forθ∈ Eanda∈ A,Pθais 1-subgaussian with mean〈θ,a〉. Letθ:E →Rdbe the identity map, which is a random vector on (E,B(E),Q).

1:Input Bayesian bandit environment (E,B(E),Q,P) 2:fort∈ 1 ,...,ndo 3: Sampleθtfrom the posterior 4: ChooseAt= argmaxa∈A〈a,θt〉 5: ObserveXt 6:end for Algorithm 24:Thompson sampling for linear bandits.

The Bayesian regret is controlled using the techniques from the previous section in combination with the concentration analysis in Chapter 20. A frequentist analysis is also possible under slightly unsatisfying assumptions, which we discuss in the notes and bibliographic remarks.

Theorem36.4.Assume that‖θ‖ 2 ≤SwithQ-probability one andsupa∈A‖a‖ 2 ≤ Landsupa∈A|〈a,θ〉| ≤ 1 withQ-probability one. Then the Bayesian regret of Algorithm 24 is bounded by

BRn≤2 + 2

√

2 dnβ^2 log

(

1 +

nS^2 L^2 d

)

,

whereβ= 1 +

√

2 log(n) +dlog

(

1 +

nS^2 L^2 d

)

.

For fixed S and L, the upper bound obtained here is of order

Bandit Algorithms

0 100 200 300 400

0

100

200

×

0 100 200 300 400

0

100

200

36.3 Linear bandits

√

(

1 +

)

,

√

(

1 +

)

.

Get our desktop app

Company

Features

Documentation

Resources