36.2 Frequentist analysis 434
calculation (Exercise 36.5) we get
E
[n
∑
t=1
I{At=i,Eic(t) occurs}
]
≤E
[
∑
t∈T
I{At=i}
]
+E
[
∑
t/∈T
I{Eic(t)}
]
≤E
[n− 1
∑
s=0
I{ 1 −Fis(μ 1 −ε)> 1 /n}
]
+E
[
∑
t/∈T
1
n
]
≤E
[n− 1
∑
s=0
I{Gis> 1 /n}
]
+ 1.
Putting together the pieces completes the proof.
By instantiating Algorithm 23 with difference choices of perturbations one
can prove that Thompson sampling enjoys frequentist guarantees in a number
of settings. The following theorem shows that Thompson sampling with an
appropriate prior is asymptotically optimal for the set of Gaussian bandits.
The reader is invited to prove this result by following the steps suggested in
Exercise 36.6.
Theorem 36.3. Suppose that Fi(1) = δ∞ is the Dirac at infinity and
Update(Fi(t),At,Xt)be the cumulative distribution function of the Gaussian
N(μˆi(t), 1 /t). Then the regret of Algorithm 23 on Gaussian banditν∈ ENk(1)
satisfies
nlim→∞
Rn
log(n)
=
∑
i:∆i> 0
2
∆i
.
Furthermore, there exists a universal constant C > 0 such that Rn ≤
C
√
nklog(n).
The choice of update and initial distributions in Theorem 36.3 correspond to
Thompson sampling when the prior mean and variance are sent to infinity at
appropriate rates. For this choice a finite-time analysis is also possible (see the
exercise).
Experiment 36.1 Empirically the algorithm described in Theorem 36.3 has a
smaller expected regret than the version of UCB analyzed in Chapter 7. Compared
to more sophisticated algorithms, however, it has larger regret and larger variance.
AdaUCB (which we briefly met in Section 9.3) and Thompson sampling were
simulated on a two-armed Gaussian bandit with mean vectorμ= (1/ 5 ,0) and
unit variance and a horizon ofn= 2000. The expected regret as estimated
over 100,000 independent runs was 23.8 for AdaUCB and 29.9 for Thompson
sampling. The figure below shows that contribution of the second moment of
Rˆn=∑i∆iTi(n) for each algorithm, which shows that Thompson sampling has
a much larger variance than AdaUCB, despite its inferior expected regret.