Bandit Algorithms

(Jeff_L) #1
36.4 Information theoretic analysis 439

π= (πt)nt=1be a policy andX∈[0,1]n×kand (At)nt=1be random elements on
some probability space (Ω,F,P) such that:

(a)P(X∈·) =Q(·).


(b)P(At∈·|X,Ht− 1 ) =πt(·|Ht− 1 ) withHt= (A 1 ,X 1 A 1 ,...,At,XtAt).


The inclusion ofXin the conditional expectation in Part (b) implies that


P(At∈·|X,Ht− 1 ) =P(At∈·|Ht− 1 ),

which means thatAtandXare conditionally independent givenHt− 1. This is
consistent with our definition of the model whereXis sampled first fromQand
thenAtdepends onXonly through the historyHt− 1. The optimal action is
A∗= argmaxa∈[k]


∑n
t=1Xtawith ties broken arbitrarily. The Bayesian regret is

BRn=E

[n

t=1

(XtA∗−XtAt)

]


.


Like in the previous sections, Thompson sampling is a policyπ= (πt)nt=1that
plays each action according to the conditional probability that it is optimal, which
means the following holds almost surely:

πt(·|A 1 ,X 1 A 1 ,...,At− 1 ,Xt− 1 At− 1 ) =P(A∗∈·|A 1 ,X 1 A 1 ,...,At− 1 ,Xt− 1 At− 1 ).

Theorem36.7.The Bayesian regret of Thompson sampling satisfies


BRn≤


knlog(k)/ 2.

The proof requires a little setup. LetFt =σ(A 1 ,X 1 A 1 ,...,At,XtAt) be
theσ-algebra containing information after roundt. LetEt[·] =E[·|Ft] and
Pt(·) =P(·|Ft) and abbreviateIt(·;·) =IPt(·;·). Define random variables

∆t=XtA∗−XtAt and Γt=

Et− 1 [∆t]^2
It− 1 (A∗;Ft)

. (36.8)


The latter of these is the ratio of the squared expected Bayesian instantaneous
regret and the information gain about the optimal arm in roundt.


Theorem36.8.Fix a real numberΓ ̄≥ 0 and suppose thatΓt≤ ̄Γholds almost
surely for al lt∈[n]. Then


BRn≤


nΓ ̄H(A∗).

Proof By the definitions of the regret and Γtin Eq. (36.8) and Cauchy-Schwarz
Free download pdf