Bandit Algorithms

(Jeff_L) #1
15.3 Notes 192

μto show that

D(Pμ,Pμ+∆)≈


∂∆


D(Pμ,Pμ+∆)

∣∣


∣∣


∆=0

∆ +


1


2


∂^2


∂∆^2


D(Pμ,Pμ+∆)

∣∣


∣∣


∆=0

∆^2


= ∂


∂∆




log

(


dPμ
dPμ+∆

)


dPμ

∣∣


∣∣


∆=0

∆ +^1


2


I(μ)∆^2

=−





∂∆


log

(


dPμ+∆
dPμ

)∣∣


∣∣


∆=0

dPμ∆ +

1


2


I(μ)∆^2

=−





∂∆


dPμ+∆
dPμ

∣∣


∣∣


∆=0

dPμ∆ +

1


2


I(μ)∆^2

=−∂


∂∆




dPμ+∆
dPμ

dPμ

∣∣


∣∣


∆=0

∆ +^1


2


I(μ)∆^2

=−



∂∆




dPμ+∆

∣∣


∣∣


∆=0

∆ +


1


2


I(μ)∆^2

=


1


2


I(μ)∆^2 ,

whereI(μ), introduced in the second line, is called theFisher information
of the family (Pμ)μatμ. Note that ifλis a common dominating measure for
(Pμ+∆) for ∆ small,dPμ+∆=pμ+∆dλand we can write

I(μ) =−


∂^2


∂∆^2


logpμ+∆

∣∣


∣∣


∆=0

pμdλ,

which is the form that is usually given in elementary texts. The upshot of all
this is thatD(Pμ,Pμ+∆) for ∆ small is indeed quadratic in ∆, with the scaling
provided byI(μ), and as a result the worst-case regret is alwaysO(


nk),
provided the class of distributions considered is sufficiently rich and not too
bizarre.
2 We have now shown a lower bound that is Ω(



nk), while many of the upper
bounds wereO(log(n)). There is no contradiction because the logarithmic
bounds depended on the inverse suboptimality gaps, which may be very large.
3 Our lower bound was only proven forn≥k−1. In Exercise 15.3 we ask you
to show that whenn < k−1 there exists a bandit such that


Rn≥

n(2k−n−1)
2 k

>


n
2

.


4 The method used to prove Theorem 15.2 can be viewed as a generalization
and strengthening ofLe Cam’s methodin statistics. Recall that Eq. (15.2)
establishes that for anyμandμ′,


infπ sup
ν

Rn(π,ν)≥n∆
8

exp(−D(Pμ,Pμ′)).

To explain Le Cam’s method we need a little notation. LetXbe an outcome
space,Pa set of measures onXandθ:P →Θ where (Θ,d) is a metric space.
Free download pdf