Bandit Algorithms

30.3 Semibandit feedback 343

wherePti=E[Ati|Ft− 1 ] withFt=σ(A 1 ,...,At). An easy calculation shows
thatE[Yˆti|Ft− 1 ] =yti.

1:Input A,η,F 2:A ̄ 1 = argmina∈co(A)F(a) 3:fort= 1,...,ndo 4: Choose distributionPtonAsuch that

∑

a∈APt(a)a=A ̄t 5: SampleAt∼Ptand observeAt 1 yt 1 ,...,Atdytd 6: ComputeYˆti=Atiyti/Ptifor alli∈[d] 7: UpdateA ̄t+1= argmina∈co(A)η〈a,Yˆt〉+DF(a,A ̄t) 8:end for Algorithm 17:Online stochastic mirror descent for semibandits.

Theorem30.2.LetF:Rd→Rbe the unnormalized negentropy potential defined
by

F(a) =

∑d

i=1

(ailog(ai)−ai).

If Algorithm 17 is run withη=

√

2 m(1 + log(d/m))/(nd), then Rn≤

√

2 nmd(1 + log(d/m)). Proof Recall from Chapter 28 that for Legendre potentials the optimization problem forA ̄t+1can be written in a two-step process: ∇F(A ̃t+1) =∇F(A ̄t)−ηYˆt A ̄t+1= argmina∈co(A)DF(a,A ̃t+1). By Theorem 28.10,

Rn≤

diamF(co(A)) η

+

1

η

∑n

t=1

E

[

DF∗(∇F(A ̃t+1),∇F(A ̄t))

]

.

The Legendre-Fenchel dual isF∗(u) =

∑d
i=1exp(ui) and the Bregman divergence
with respect to this potential is

DF∗(u,v) =

∑d

i=1

(exp(ui)−exp(vi))−

∑d

i=1

(ui−vi) exp(vi).

Since∇F(a)i= log(ai),

DF∗(∇F(A ̃t+1),∇F(A ̄t)) =

∑d

i=1

(A ̃t+1,i−A ̄ti) +

∑d

i=1

ηA ̄tiYˆti

=

∑d

i=1

A ̄ti

(

exp(−ηYˆti)−1 +ηYˆti

)

≤

η^2 2

∑d

i=1

A ̄tiYˆti^2.

Bandit Algorithms

∑

√

√

+

1

E

[

]

.

=

(

)

≤

Get our desktop app

Company

Features

Documentation

Resources