36.1 Finite-armed bandits 430
36.1 Finite-armed bandits
Recalling the notation from Section 34.5, letk >1 and (E,B(E),Q,P) be a
k-armed Bayesian bandit environment. The learner chooses actions (At)nt=1and
receives rewards (Xt)nt=1and the posterior aftertobservations is a probability
kernelQ(·|·) from ([k]×R)tto (E,B(E)). Denote the mean of theith arm in
banditν∈Ebyμi(ν) =
∫
RxdPνi(x). In roundt, Thompson sampling samples a
bandit environmentνtfrom the posterior ofQgivenA 1 ,X 1 ,...,At− 1 ,Xt− 1 and
then chooses the arm with the largest mean (Algorithm 22). We assume that
ties are resolved in an arbitrary, but systematic fashion.
1:Input Bayesian bandit environment (E,B(E),Q,P)
2:fort= 1, 2 ,...,ndo
3: Sampleνt∼Q(·|A 1 ,X 1 ,...,At− 1 ,Xt− 1 )
4: ChooseAt= argmaxi∈[k]μi(νt)
5:end for
Algorithm 22:Thompson sampling.
Thompson sampling has been analyzed in both the frequentist and the Bayesian
settings. We start with the latter where the result requires almost no assumptions
on the prior. In fact, after one small observation about Thompson sampling, the
analysis is almost the same as that of UCB.
Theorem36.1.Let(E,B(E),Q,P)be ak-armed Bayesian bandit environment
such that for al lν∈Eandi∈[k]the distributionPνiis 1 -subgaussian with mean
in[0,1]. Then the policyπof Thompson sampling satisfies
BRn(π,Q)≤C
√
knlog(n),
whereC > 0 is a universal constant.
Proof Abbreviateμi=μi(ν) and letA∗=argmaxi∈[k]μibe the optimal arm,
which depends onνand is a random variable. When there are ties, we use the
same tie-breaking rule as in the algorithm in the definition ofA∗. For eacht∈[n]
andi∈[k] let
Ut(i) = clip[0,1]
(
μˆi(t−1) +
√
2 log(1/δ)
1 ∨Ti(t−1)
)
,
whereˆμi(t−1) is the empirical estimate of the reward of armiaftert−1 rounds
and we assumeμˆi(t−1) = 0 ifTi(t−1) = 0. LetEbe the event that for all
t∈[n] andi∈[k],
|ˆμi(t−1)−μi|<
√
2 log(1/δ)
1 ∨Ti(t−1)
.
In Exercise 36.2 we ask you to prove that P(Ec) ≤ nkδ. Let Ft =