36.4 Information theoretic analysis 440
we have
BRn=E
[n
∑
t=1
(XtA∗−XtAt)
]
=E
[n
∑
t=1
Et− 1 [XtA∗−XtAt]
]
=E
[n
∑
t=1
√
It− 1 (A∗;Ft)Γt
]
≤
√
Γ ̄E
[n
∑
t=1
√
It− 1 (A∗;Ft)
]
≤
√√
√√
nΓ ̄E
[n
∑
t=1
It− 1 (A∗;Ft)
]
≤
√
nΓ ̄H(A∗),
where the last inequality follows from Lemma 36.5.
Theorem 36.8 holds for any policy and it holds even without any restriction
on the range of the rewards. The beauty of this result is that it makes
the ‘learn something’ or ‘suffer no regret’ argument, which appeared in the
analyses of so many algorithms, explicit: If the ratio of (squared) regret
relative to information is large, then a policy could suffer high regret. By
contrast, policies for which the regret/information ratio is small will enjoy
strong regret guarantees.
A combination of Theorem 36.8 and an almost sure bound on Γtcan lead
to nearly optimal bounds on the Bayesian regret of Thompson sampling for
finite-armed and linear bandits. We present only the finite-armed case.
Lemma36.9.IfXti∈[0,1]almost surely for allt∈[n]andi∈[k]andAtis
chosen by Thompson sampling using any prior, thenΓt≤k 2 almost surely.
Proof of Lemma 36.9 Expanding the definition of mutual information and then
using the tower rule we have
It− 1 (A∗;Ft) =Et− 1
[
log
Pt− 1 (A∗=a|Ft)
Pt− 1 (A∗=a)
∣∣
∣∣
a=A∗
]
=Et− 1
[
E ̃t− 1
[
logPt−^1 (A
∗=a|Ft)
Pt− 1 (A∗=a)
∣∣
∣∣
a=A∗
]]
,
where we introduceE ̃t− 1 (·) =Et− 1 (·|At) =E[·|Ht− 1 ,At]. Analogously defining
̃Pt− 1 (·) =Pt− 1 (·|At) =P(·|Ht− 1 ,At), notice that
Pt− 1 (A∗=a|Ft) =P(A∗=a|Ht) =P ̃t− 1 (A∗=a|XtAt) and
Pt− 1 (A∗=a) =P(A∗=a|Ht− 1 ) =P(A∗=a|Ht− 1 ,At) =P ̃t− 1 (A∗=a),
where the second last equality holds because Atand A∗ are conditionally
independent givenHt− 1 , which follows sinceA∗isσ(X)-measurable andX