30.3 Semibandit feedback 343
wherePti=E[Ati|Ft− 1 ] withFt=σ(A 1 ,...,At). An easy calculation shows
thatE[Yˆti|Ft− 1 ] =yti.
1:Input A,η,F
2:A ̄ 1 = argmina∈co(A)F(a)
3:fort= 1,...,ndo
4: Choose distributionPtonAsuch that
∑
a∈APt(a)a=A ̄t
5: SampleAt∼Ptand observeAt 1 yt 1 ,...,Atdytd
6: ComputeYˆti=Atiyti/Ptifor alli∈[d]
7: UpdateA ̄t+1= argmina∈co(A)η〈a,Yˆt〉+DF(a,A ̄t)
8:end for
Algorithm 17:Online stochastic mirror descent for semibandits.
Theorem30.2.LetF:Rd→Rbe the unnormalized negentropy potential defined
by
F(a) =
∑d
i=1
(ailog(ai)−ai).
If Algorithm 17 is run withη=
√
2 m(1 + log(d/m))/(nd), then
Rn≤
√
2 nmd(1 + log(d/m)).
Proof Recall from Chapter 28 that for Legendre potentials the optimization
problem forA ̄t+1can be written in a two-step process:
∇F(A ̃t+1) =∇F(A ̄t)−ηYˆt A ̄t+1= argmina∈co(A)DF(a,A ̃t+1).
By Theorem 28.10,
Rn≤
diamF(co(A))
η
+
1
η
∑n
t=1
E
[
DF∗(∇F(A ̃t+1),∇F(A ̄t))
]
.
The Legendre-Fenchel dual isF∗(u) =
∑d
i=1exp(ui) and the Bregman divergence
with respect to this potential is
DF∗(u,v) =
∑d
i=1
(exp(ui)−exp(vi))−
∑d
i=1
(ui−vi) exp(vi).
Since∇F(a)i= log(ai),
DF∗(∇F(A ̃t+1),∇F(A ̄t)) =
∑d
i=1
(A ̃t+1,i−A ̄ti) +
∑d
i=1
ηA ̄tiYˆti
=
∑d
i=1
A ̄ti
(
exp(−ηYˆti)−1 +ηYˆti
)
≤
η^2
2
∑d
i=1
A ̄tiYˆti^2.