Bandit Algorithms

36.4 Information theoretic analysis 440

we have

BRn=E

[n ∑

t=1

(XtA∗−XtAt)

]

=E

[n ∑

t=1

Et− 1 [XtA∗−XtAt]

]

=E

[n ∑

t=1

√

It− 1 (A∗;Ft)Γt

]

≤

√

Γ ̄E

[n ∑

t=1

√

It− 1 (A∗;Ft)

]

≤

√√

nΓ ̄E

[n ∑

t=1

It− 1 (A∗;Ft)

]

≤

√

nΓ ̄H(A∗),

where the last inequality follows from Lemma 36.5.

Theorem 36.8 holds for any policy and it holds even without any restriction on the range of the rewards. The beauty of this result is that it makes the ‘learn something’ or ‘suffer no regret’ argument, which appeared in the analyses of so many algorithms, explicit: If the ratio of (squared) regret relative to information is large, then a policy could suffer high regret. By contrast, policies for which the regret/information ratio is small will enjoy strong regret guarantees.

A combination of Theorem 36.8 and an almost sure bound on Γtcan lead to nearly optimal bounds on the Bayesian regret of Thompson sampling for finite-armed and linear bandits. We present only the finite-armed case.

Lemma36.9.IfXti∈[0,1]almost surely for allt∈[n]andi∈[k]andAtis
chosen by Thompson sampling using any prior, thenΓt≤k 2 almost surely.

Proof of Lemma 36.9 Expanding the definition of mutual information and then using the tower rule we have

It− 1 (A∗;Ft) =Et− 1

[

log

Pt− 1 (A∗=a|Ft) Pt− 1 (A∗=a)

∣∣

a=A∗

]

=Et− 1

[

E ̃t− 1

[

logPt−^1 (A

∗=a|Ft) Pt− 1 (A∗=a)

∣∣

a=A∗

]]

,

where we introduceE ̃t− 1 (·) =Et− 1 (·|At) =E[·|Ht− 1 ,At]. Analogously defining
̃Pt− 1 (·) =Pt− 1 (·|At) =P(·|Ht− 1 ,At), notice that

where the second last equality holds because Atand A∗ are conditionally
independent givenHt− 1 , which follows sinceA∗isσ(X)-measurable andX

Bandit Algorithms

]

=E

]

=E

√

]

≤

√

Γ ̄E

√

]

≤

√√

√√

]

≤

√

[

∣∣

∣∣

]

[

[

∣∣

∣∣

]]

,

Get our desktop app

Company

Features

Documentation

Resources