Bandit Algorithms

21.1 The Kiefer–Wolfowitz theorem 253

(b)⇒(a): Suppose thatπ∗is a maximizer off. By the first-order optimality
criterion (see Section 26.5), for anyπdistribution onA,

0 ≥〈∇f(π∗),π−π∗〉 =

∑

a∈A

π(a)‖a‖^2 V(π∗)− 1 −

∑

a∈A

π∗(a)‖a‖^2 V(π∗)− 1

=

∑

a∈A

π(a)‖a‖^2 V(π∗)− 1 −d.

For an arbitrarya ∈ A, choosing πto be the Dirac ata∈ Aproves that ‖a‖^2 V(π∗)− 1 ≤d. Henceg(π∗)≤d. Sinceg(π)≥dfor allπby Eq. (21.5), it follows thatπ∗is a minimizer ofgand thatminπg(π) =d. (c) =⇒(b): Suppose that g(π∗) =d. Then for anyπ,

〈∇f(π∗),π−π∗〉=

∑

a∈A

π(a)‖a‖V(π∗)− 1 −d≤ 0.

And it follows thatπ∗is a maximizer offby the first order optimality conditions
and the concavity off. That (a) =⇒(c) is now trivial. To prove the second
part of the theorem letπ∗be a minimizer ofg, which by the previous part is a
maximizer off. LetS=Supp(π∗) and suppose that|S|> d(d+ 1)/2. Since the
dimension of the subspace ofd×dsymmetric matrices isd(d+ 1)/2, there must
be a nonzero functionv:A→Rwith Supp(v)⊆Ssuch that
∑

a∈S

v(a)aa>= 0. (21.6)

Notice that for anya ∈Sthe first order optimality conditions ensure that ‖a‖^2 V(π∗)− 1 =d(Exercise 21.5). Hence

d

∑

a∈S

v(a) =

∑

a∈S

v(a)‖a‖^2 V(π∗)− 1 = 0,

where the last equality follows from Eq. (21.6). Letπ(t) =π∗+tvand assume
‖v‖ 2 = 1 and letT= max{t:π(t)∈PA}, which exists sincev 6 = 0. Then

f(π(T)) =f(π∗) +

∫T

0

d dtf(π(t))dt

=f(π∗) +

∫T

0

〈∇f(π(t)),v〉dt

=f(π∗) +

∫T

0

∑

a∈A

v(a)‖a‖^2 V(π(t))− 1 dt

=f(π∗),

where we used the fact thatV(π(t)) is constant and‖a‖^2 V(π(t))− 1 =dfor all
a∈Supp(v). Henceπ(T) also maximizes f. The claim follows by checking
that thanks to

∑

a∈Sv(a) = 0,|Supp(π(T))|<|Supp(π

∗)|and then using induction.

Bandit Algorithms

∑

∑

=

∑

∑

∑

∑

∫T

∫T

∫T

∑

∑

Get our desktop app

Company

Features

Documentation

Resources