Bandit Algorithms

(Jeff_L) #1
30.3 Semibandit feedback 343

wherePti=E[Ati|Ft− 1 ] withFt=σ(A 1 ,...,At). An easy calculation shows
thatE[Yˆti|Ft− 1 ] =yti.


1:Input A,η,F
2:A ̄ 1 = argmina∈co(A)F(a)
3:fort= 1,...,ndo
4: Choose distributionPtonAsuch that


a∈APt(a)a=A ̄t
5: SampleAt∼Ptand observeAt 1 yt 1 ,...,Atdytd
6: ComputeYˆti=Atiyti/Ptifor alli∈[d]
7: UpdateA ̄t+1= argmina∈co(A)η〈a,Yˆt〉+DF(a,A ̄t)
8:end for
Algorithm 17:Online stochastic mirror descent for semibandits.

Theorem30.2.LetF:Rd→Rbe the unnormalized negentropy potential defined
by


F(a) =

∑d

i=1

(ailog(ai)−ai).

If Algorithm 17 is run withη=



2 m(1 + log(d/m))/(nd), then
Rn≤


2 nmd(1 + log(d/m)).
Proof Recall from Chapter 28 that for Legendre potentials the optimization
problem forA ̄t+1can be written in a two-step process:
∇F(A ̃t+1) =∇F(A ̄t)−ηYˆt A ̄t+1= argmina∈co(A)DF(a,A ̃t+1).
By Theorem 28.10,

Rn≤

diamF(co(A))
η

+


1


η

∑n

t=1

E


[


DF∗(∇F(A ̃t+1),∇F(A ̄t))

]


.


The Legendre-Fenchel dual isF∗(u) =


∑d
i=1exp(ui) and the Bregman divergence
with respect to this potential is


DF∗(u,v) =

∑d

i=1

(exp(ui)−exp(vi))−

∑d

i=1

(ui−vi) exp(vi).

Since∇F(a)i= log(ai),

DF∗(∇F(A ̃t+1),∇F(A ̄t)) =

∑d

i=1

(A ̃t+1,i−A ̄ti) +

∑d

i=1

ηA ̄tiYˆti

=


∑d

i=1

A ̄ti

(


exp(−ηYˆti)−1 +ηYˆti

)



η^2
2

∑d

i=1

A ̄tiYˆti^2.
Free download pdf