15.3 Notes 192μto show thatD(Pμ,Pμ+∆)≈∂
∂∆
D(Pμ,Pμ+∆)∣∣
∣∣
∆=0∆ +
1
2
∂^2
∂∆^2
D(Pμ,Pμ+∆)∣∣
∣∣
∆=0∆^2
= ∂
∂∆
∫
Ωlog(
dPμ
dPμ+∆)
dPμ∣∣
∣∣
∆=0∆ +^1
2
I(μ)∆^2=−
∫
Ω∂
∂∆
log(
dPμ+∆
dPμ)∣∣
∣∣
∆=0dPμ∆ +1
2
I(μ)∆^2=−
∫
Ω∂
∂∆
dPμ+∆
dPμ∣∣
∣∣
∆=0dPμ∆ +1
2
I(μ)∆^2=−∂
∂∆
∫
ΩdPμ+∆
dPμdPμ∣∣
∣∣
∆=0∆ +^1
2
I(μ)∆^2=−
∂
∂∆
∫
ΩdPμ+∆∣∣
∣∣
∆=0∆ +
1
2
I(μ)∆^2=
1
2
I(μ)∆^2 ,whereI(μ), introduced in the second line, is called theFisher information
of the family (Pμ)μatμ. Note that ifλis a common dominating measure for
(Pμ+∆) for ∆ small,dPμ+∆=pμ+∆dλand we can writeI(μ) =−∫
∂^2
∂∆^2
logpμ+∆∣∣
∣∣
∆=0pμdλ,which is the form that is usually given in elementary texts. The upshot of all
this is thatD(Pμ,Pμ+∆) for ∆ small is indeed quadratic in ∆, with the scaling
provided byI(μ), and as a result the worst-case regret is alwaysO(√
nk),
provided the class of distributions considered is sufficiently rich and not too
bizarre.
2 We have now shown a lower bound that is Ω(
√
nk), while many of the upper
bounds wereO(log(n)). There is no contradiction because the logarithmic
bounds depended on the inverse suboptimality gaps, which may be very large.
3 Our lower bound was only proven forn≥k−1. In Exercise 15.3 we ask you
to show that whenn < k−1 there exists a bandit such that
Rn≥n(2k−n−1)
2 k>
n
2.
4 The method used to prove Theorem 15.2 can be viewed as a generalization
and strengthening ofLe Cam’s methodin statistics. Recall that Eq. (15.2)
establishes that for anyμandμ′,
infπ sup
νRn(π,ν)≥n∆
8exp(−D(Pμ,Pμ′)).To explain Le Cam’s method we need a little notation. LetXbe an outcome
space,Pa set of measures onXandθ:P →Θ where (Θ,d) is a metric space.