Bandit Algorithms

36.4 Information theoretic analysis 439

π= (πt)nt=1be a policy andX∈[0,1]n×kand (At)nt=1be random elements on some probability space (Ω,F,P) such that:

(a)P(X∈·) =Q(·).

(b)P(At∈·|X,Ht− 1 ) =πt(·|Ht− 1 ) withHt= (A 1 ,X 1 A 1 ,...,At,XtAt).

The inclusion ofXin the conditional expectation in Part (b) implies that

P(At∈·|X,Ht− 1 ) =P(At∈·|Ht− 1 ),

which means thatAtandXare conditionally independent givenHt− 1. This is
consistent with our definition of the model whereXis sampled first fromQand
thenAtdepends onXonly through the historyHt− 1. The optimal action is
A∗= argmaxa∈[k]

∑n t=1Xtawith ties broken arbitrarily. The Bayesian regret is

BRn=E

[n ∑

t=1

(XtA∗−XtAt)

]

.

Like in the previous sections, Thompson sampling is a policyπ= (πt)nt=1that plays each action according to the conditional probability that it is optimal, which means the following holds almost surely:

πt(·|A 1 ,X 1 A 1 ,...,At− 1 ,Xt− 1 At− 1 ) =P(A∗∈·|A 1 ,X 1 A 1 ,...,At− 1 ,Xt− 1 At− 1 ).

Theorem36.7.The Bayesian regret of Thompson sampling satisfies

BRn≤

√

knlog(k)/ 2.

The proof requires a little setup. LetFt =σ(A 1 ,X 1 A 1 ,...,At,XtAt) be theσ-algebra containing information after roundt. LetEt[·] =E[·|Ft] and Pt(·) =P(·|Ft) and abbreviateIt(·;·) =IPt(·;·). Define random variables

∆t=XtA∗−XtAt and Γt=

Et− 1 [∆t]^2 It− 1 (A∗;Ft)

. (36.8)

The latter of these is the ratio of the squared expected Bayesian instantaneous
regret and the information gain about the optimal arm in roundt.

Theorem36.8.Fix a real numberΓ ̄≥ 0 and suppose thatΓt≤ ̄Γholds almost
surely for al lt∈[n]. Then

BRn≤

√

nΓ ̄H(A∗).

Proof By the definitions of the regret and Γtin Eq. (36.8) and Cauchy-Schwarz

Bandit Algorithms

]

.

√

. (36.8)

√

Get our desktop app

Company

Features

Documentation

Resources