36.4 Information theoretic analysis 439
π= (πt)nt=1be a policy andX∈[0,1]n×kand (At)nt=1be random elements on
some probability space (Ω,F,P) such that:
(a)P(X∈·) =Q(·).
(b)P(At∈·|X,Ht− 1 ) =πt(·|Ht− 1 ) withHt= (A 1 ,X 1 A 1 ,...,At,XtAt).
The inclusion ofXin the conditional expectation in Part (b) implies that
P(At∈·|X,Ht− 1 ) =P(At∈·|Ht− 1 ),
which means thatAtandXare conditionally independent givenHt− 1. This is
consistent with our definition of the model whereXis sampled first fromQand
thenAtdepends onXonly through the historyHt− 1. The optimal action is
A∗= argmaxa∈[k]
∑n
t=1Xtawith ties broken arbitrarily. The Bayesian regret is
BRn=E
[n
∑
t=1
(XtA∗−XtAt)
]
.
Like in the previous sections, Thompson sampling is a policyπ= (πt)nt=1that
plays each action according to the conditional probability that it is optimal, which
means the following holds almost surely:
πt(·|A 1 ,X 1 A 1 ,...,At− 1 ,Xt− 1 At− 1 ) =P(A∗∈·|A 1 ,X 1 A 1 ,...,At− 1 ,Xt− 1 At− 1 ).
Theorem36.7.The Bayesian regret of Thompson sampling satisfies
BRn≤
√
knlog(k)/ 2.
The proof requires a little setup. LetFt =σ(A 1 ,X 1 A 1 ,...,At,XtAt) be
theσ-algebra containing information after roundt. LetEt[·] =E[·|Ft] and
Pt(·) =P(·|Ft) and abbreviateIt(·;·) =IPt(·;·). Define random variables
∆t=XtA∗−XtAt and Γt=
Et− 1 [∆t]^2
It− 1 (A∗;Ft)
. (36.8)
The latter of these is the ratio of the squared expected Bayesian instantaneous
regret and the information gain about the optimal arm in roundt.
Theorem36.8.Fix a real numberΓ ̄≥ 0 and suppose thatΓt≤ ̄Γholds almost
surely for al lt∈[n]. Then
BRn≤
√
nΓ ̄H(A∗).
Proof By the definitions of the regret and Γtin Eq. (36.8) and Cauchy-Schwarz