15.3 Notes 192
μto show that
D(Pμ,Pμ+∆)≈
∂
∂∆
D(Pμ,Pμ+∆)
∣∣
∣∣
∆=0
∆ +
1
2
∂^2
∂∆^2
D(Pμ,Pμ+∆)
∣∣
∣∣
∆=0
∆^2
= ∂
∂∆
∫
Ω
log
(
dPμ
dPμ+∆
)
dPμ
∣∣
∣∣
∆=0
∆ +^1
2
I(μ)∆^2
=−
∫
Ω
∂
∂∆
log
(
dPμ+∆
dPμ
)∣∣
∣∣
∆=0
dPμ∆ +
1
2
I(μ)∆^2
=−
∫
Ω
∂
∂∆
dPμ+∆
dPμ
∣∣
∣∣
∆=0
dPμ∆ +
1
2
I(μ)∆^2
=−∂
∂∆
∫
Ω
dPμ+∆
dPμ
dPμ
∣∣
∣∣
∆=0
∆ +^1
2
I(μ)∆^2
=−
∂
∂∆
∫
Ω
dPμ+∆
∣∣
∣∣
∆=0
∆ +
1
2
I(μ)∆^2
=
1
2
I(μ)∆^2 ,
whereI(μ), introduced in the second line, is called theFisher information
of the family (Pμ)μatμ. Note that ifλis a common dominating measure for
(Pμ+∆) for ∆ small,dPμ+∆=pμ+∆dλand we can write
I(μ) =−
∫
∂^2
∂∆^2
logpμ+∆
∣∣
∣∣
∆=0
pμdλ,
which is the form that is usually given in elementary texts. The upshot of all
this is thatD(Pμ,Pμ+∆) for ∆ small is indeed quadratic in ∆, with the scaling
provided byI(μ), and as a result the worst-case regret is alwaysO(
√
nk),
provided the class of distributions considered is sufficiently rich and not too
bizarre.
2 We have now shown a lower bound that is Ω(
√
nk), while many of the upper
bounds wereO(log(n)). There is no contradiction because the logarithmic
bounds depended on the inverse suboptimality gaps, which may be very large.
3 Our lower bound was only proven forn≥k−1. In Exercise 15.3 we ask you
to show that whenn < k−1 there exists a bandit such that
Rn≥
n(2k−n−1)
2 k
>
n
2
.
4 The method used to prove Theorem 15.2 can be viewed as a generalization
and strengthening ofLe Cam’s methodin statistics. Recall that Eq. (15.2)
establishes that for anyμandμ′,
infπ sup
ν
Rn(π,ν)≥n∆
8
exp(−D(Pμ,Pμ′)).
To explain Le Cam’s method we need a little notation. LetXbe an outcome
space,Pa set of measures onXandθ:P →Θ where (Θ,d) is a metric space.