36.3 Linear bandits 435
0 100 200 300 400
0
100
200
300 Thompson sampling
Regret
Proportion
×
Regret
2
0 100 200 300 400
0
100
200
300 AdaUCB
Regret
36.3 Linear bandits
While the advantages of Thompson sampling in finite-armed bandits are relatively
limited, in the linear setting there is much to be gained, both in terms of
computation and empirical performance. LetA⊂Rdand (E,B(E),Q,P) be a
Bayesian bandit environment whereE ⊂Rdand forθ∈ Eanda∈ A,Pθais
1-subgaussian with mean〈θ,a〉. Letθ:E →Rdbe the identity map, which is a
random vector on (E,B(E),Q).
1:Input Bayesian bandit environment (E,B(E),Q,P)
2:fort∈ 1 ,...,ndo
3: Sampleθtfrom the posterior
4: ChooseAt= argmaxa∈A〈a,θt〉
5: ObserveXt
6:end for
Algorithm 24:Thompson sampling for linear bandits.
The Bayesian regret is controlled using the techniques from the previous section
in combination with the concentration analysis in Chapter 20. A frequentist
analysis is also possible under slightly unsatisfying assumptions, which we discuss
in the notes and bibliographic remarks.
Theorem36.4.Assume that‖θ‖ 2 ≤SwithQ-probability one andsupa∈A‖a‖ 2 ≤
Landsupa∈A|〈a,θ〉| ≤ 1 withQ-probability one. Then the Bayesian regret of
Algorithm 24 is bounded by
BRn≤2 + 2
√
2 dnβ^2 log
(
1 +
nS^2 L^2
d
)
,
whereβ= 1 +
√
2 log(n) +dlog
(
1 +
nS^2 L^2
d
)
.
For fixed S and L, the upper bound obtained here is of order