36.5 Notes 441
andAtare conditionally independent givenA∗. Hence,
It− 1 (A∗;Ft) =Et− 1
[
E ̃t− 1
[
log
P ̃t− 1 (A∗=a|XtAt)
̃Pt− 1 (A∗=a)
∣∣
∣∣
a=A∗
]]
=Et− 1
[
I ̃Pt− 1 (A∗;XtAt)
]
=Et− 1
[
D( ̃Pt− 1 (XtAt∈·|A∗), ̃Pt− 1 (XtAt∈·))
]
(Lemma 36.6)
≥ 2 Et− 1
[
(E ̃t− 1 [XtAt|A∗]−E ̃t− 1 [XtAt])^2
]
= 2
∑k
i,j=1
Pt− 1 (At=i,A∗=j)(Et− 1 [Xti|A∗=j]−Et− 1 [Xti])^2
≥ 2
∑k
i=1
Pt− 1 (At=i)^2 (Et− 1 [Xti|A∗=i]−Et− 1 [Xti])^2
≥
2
k
(k
∑
i=1
Pt− 1 (At=i) (Et− 1 [Xti|A∗=i]−Et− 1 [Xti])
) 2
=
2
k
Et− 1 [∆t]^2.
where the first inequality follows from Pinsker’s inequality (Eq. 14.11), the result
in Exercise 14.4 and the assumption that the rewards lie in [0,1]. The second
inequality follows by dropping cross terms and using again the independence of
A∗andAtgivenFt− 1 , while the third is by Cauchy-Schwarz. The result follows
by rearranging the above display.
Proof of Theorem 36.7 The result follows by combining Lemma 36.9 and
Theorem 36.8.
36.5 Notes
1 There are several equivalent ways to view Thompson sampling in stationary
stochastic multi-armed bandits:(a)select an arm according to the posterior
probability that the arm is optimal,(b)sample an environment from the
posterior and play the optimal action in that environment. When the mean
rewards for each arm are independent under the posterior, then also equivalent
is(c)sample the mean reward for each arm and choose the arm with the largest
mean (Exercise 36.1). The algorithms in this chapter are based on(b), but all
are equivalent and simply correspond to sampling from different pushforward
measures of the posterior. Historically it seems that Thompson [1933] had the
form in(a)in mind, but there are reasons to keep remember the alternative
views. Though we are not aware of an example, in some instances beyond
finite-armed bandits it might be more computationally efficient to sample
from a pushforward of the posterior than the posterior itself. Furthermore, in
more complicated situations like reinforcement learning it may be desirable to