30.3 Semibandit feedback 343wherePti=E[Ati|Ft− 1 ] withFt=σ(A 1 ,...,At). An easy calculation shows
thatE[Yˆti|Ft− 1 ] =yti.
1:Input A,η,F
2:A ̄ 1 = argmina∈co(A)F(a)
3:fort= 1,...,ndo
4: Choose distributionPtonAsuch that∑
a∈APt(a)a=A ̄t
5: SampleAt∼Ptand observeAt 1 yt 1 ,...,Atdytd
6: ComputeYˆti=Atiyti/Ptifor alli∈[d]
7: UpdateA ̄t+1= argmina∈co(A)η〈a,Yˆt〉+DF(a,A ̄t)
8:end for
Algorithm 17:Online stochastic mirror descent for semibandits.Theorem30.2.LetF:Rd→Rbe the unnormalized negentropy potential defined
by
F(a) =∑di=1(ailog(ai)−ai).If Algorithm 17 is run withη=
√
2 m(1 + log(d/m))/(nd), then
Rn≤√
2 nmd(1 + log(d/m)).
Proof Recall from Chapter 28 that for Legendre potentials the optimization
problem forA ̄t+1can be written in a two-step process:
∇F(A ̃t+1) =∇F(A ̄t)−ηYˆt A ̄t+1= argmina∈co(A)DF(a,A ̃t+1).
By Theorem 28.10,Rn≤diamF(co(A))
η+
1
η∑nt=1E
[
DF∗(∇F(A ̃t+1),∇F(A ̄t))]
.
The Legendre-Fenchel dual isF∗(u) =
∑d
i=1exp(ui) and the Bregman divergence
with respect to this potential is
DF∗(u,v) =∑di=1(exp(ui)−exp(vi))−∑di=1(ui−vi) exp(vi).Since∇F(a)i= log(ai),DF∗(∇F(A ̃t+1),∇F(A ̄t)) =∑di=1(A ̃t+1,i−A ̄ti) +∑di=1ηA ̄tiYˆti=
∑di=1A ̄ti(
exp(−ηYˆti)−1 +ηYˆti)
≤
η^2
2∑di=1A ̄tiYˆti^2.