Bandit Algorithms

(Jeff_L) #1
21.1 The Kiefer–Wolfowitz theorem 253

(b)⇒(a): Suppose thatπ∗is a maximizer off. By the first-order optimality
criterion (see Section 26.5), for anyπdistribution onA,


0 ≥〈∇f(π∗),π−π∗〉
=


a∈A

π(a)‖a‖^2 V(π∗)− 1 −


a∈A

π∗(a)‖a‖^2 V(π∗)− 1

=



a∈A

π(a)‖a‖^2 V(π∗)− 1 −d.

For an arbitrarya ∈ A, choosing πto be the Dirac ata∈ Aproves that
‖a‖^2 V(π∗)− 1 ≤d. Henceg(π∗)≤d. Sinceg(π)≥dfor allπby Eq. (21.5), it follows
thatπ∗is a minimizer ofgand thatminπg(π) =d. (c) =⇒(b): Suppose that
g(π∗) =d. Then for anyπ,

〈∇f(π∗),π−π∗〉=


a∈A

π(a)‖a‖V(π∗)− 1 −d≤ 0.

And it follows thatπ∗is a maximizer offby the first order optimality conditions
and the concavity off. That (a) =⇒(c) is now trivial. To prove the second
part of the theorem letπ∗be a minimizer ofg, which by the previous part is a
maximizer off. LetS=Supp(π∗) and suppose that|S|> d(d+ 1)/2. Since the
dimension of the subspace ofd×dsymmetric matrices isd(d+ 1)/2, there must
be a nonzero functionv:A→Rwith Supp(v)⊆Ssuch that


a∈S

v(a)aa>= 0. (21.6)

Notice that for anya ∈Sthe first order optimality conditions ensure that
‖a‖^2 V(π∗)− 1 =d(Exercise 21.5). Hence

d


a∈S

v(a) =


a∈S

v(a)‖a‖^2 V(π∗)− 1 = 0,

where the last equality follows from Eq. (21.6). Letπ(t) =π∗+tvand assume
‖v‖ 2 = 1 and letT= max{t:π(t)∈PA}, which exists sincev 6 = 0. Then


f(π(T)) =f(π∗) +

∫T


0

d
dtf(π(t))dt

=f(π∗) +

∫T


0

〈∇f(π(t)),v〉dt

=f(π∗) +

∫T


0


a∈A

v(a)‖a‖^2 V(π(t))− 1 dt

=f(π∗),

where we used the fact thatV(π(t)) is constant and‖a‖^2 V(π(t))− 1 =dfor all
a∈Supp(v). Henceπ(T) also maximizes f. The claim follows by checking
that thanks to



a∈Sv(a) = 0,|Supp(π(T))|<|Supp(π

∗)|and then using
induction.
Free download pdf