Bandit Algorithms

(Jeff_L) #1
15.2 Minimax lower bounds 189

as

pνπ(a 1 ,x 1 ,...,an,xn) =

∏n

t=1

πt(at|a 1 ,x 1 ,...,at− 1 ,xt− 1 )pat(xt).

The density ofPν′is identical except thatpatis replaced byp′at. Then

logdPν
dPν′

(a 1 ,x 1 ,...,an,xn) =

∑n

t=1

logpat(xt)
p′at(xt)

,


where we used the chain rule for Radon-Nikodym derivatives and the fact that
the terms involving the policy cancel. Taking expectations of both sides:


[


logdPν
dPν′

(At,Xt)

]


=


∑n

t=1


[


logpAt(Xt)
p′At(Xt)

]


,


and


[


logpAt(Xt)
p′At(Xt)

]


=Eν

[



[


logpAt(Xt)
p′At(Xt)

∣∣


∣∣



At

]]


=Eν

[


D(PAt,PA′t)

]


,


where in the second equality we used that underPν(·|At) the distribution ofXt
isdPAt=pAtdλ. Plugging back into the previous display,


[


logdPν
dPν′

(A 1 ,X 1 ,...,An,Xn)

]


=


∑n

t=1


[


logpAt(Xt)
p′At(Xt)

]


=


∑n

t=1


[


D(PAt,PA′t)

]


=


∑k

i=1


[n

t=1

I{At=i}D(PAt,PA′t)

]


=


∑k

i=1

Eν[Ti(n)] D(Pi,Pi′).

When the right-hand side of(15.1)is infinite, by our previous calculation, it is
not hard to see that the left-hand side will also be infinite.

We note in passing that the divergence decomposition holds regardless of
whether the action set is discrete or not. In its more general form, the sum
over the actions must be replaced by an integral with respect to an appropriate
nonnegative measure, which generalizes the expected number of pulls of arms.
For details, see Exercise 15.8.

15.2 Minimax lower bounds


Recall thatENk(1) is the class of Gaussian bandits with unit variance, which
can be parameterized by their mean vectorμ∈Rk. Givenμ∈Rkletνμbe the
Gaussian bandit for which theith arm has reward distributionN(μi,1).
Free download pdf