Bandit Algorithms

(Jeff_L) #1
36.3 Linear bandits 435

0 100 200 300 400


0


100


200


300 Thompson sampling

Regret

Proportion

×


Regret

2

0 100 200 300 400


0


100


200


300 AdaUCB

Regret

36.3 Linear bandits


While the advantages of Thompson sampling in finite-armed bandits are relatively
limited, in the linear setting there is much to be gained, both in terms of
computation and empirical performance. LetA⊂Rdand (E,B(E),Q,P) be a
Bayesian bandit environment whereE ⊂Rdand forθ∈ Eanda∈ A,Pθais
1-subgaussian with mean〈θ,a〉. Letθ:E →Rdbe the identity map, which is a
random vector on (E,B(E),Q).

1:Input Bayesian bandit environment (E,B(E),Q,P)
2:fort∈ 1 ,...,ndo
3: Sampleθtfrom the posterior
4: ChooseAt= argmaxa∈A〈a,θt〉
5: ObserveXt
6:end for
Algorithm 24:Thompson sampling for linear bandits.

The Bayesian regret is controlled using the techniques from the previous section
in combination with the concentration analysis in Chapter 20. A frequentist
analysis is also possible under slightly unsatisfying assumptions, which we discuss
in the notes and bibliographic remarks.

Theorem36.4.Assume that‖θ‖ 2 ≤SwithQ-probability one andsupa∈A‖a‖ 2 ≤
Landsupa∈A|〈a,θ〉| ≤ 1 withQ-probability one. Then the Bayesian regret of
Algorithm 24 is bounded by

BRn≤2 + 2


2 dnβ^2 log

(


1 +


nS^2 L^2
d

)


,


whereβ= 1 +


2 log(n) +dlog

(


1 +


nS^2 L^2
d

)


.


For fixed S and L, the upper bound obtained here is of order
Free download pdf