Bandit Algorithms

30.4 Follow the perturbed leader 344

where the inequality follows from the fact thatexp(−x)≤ 1 −x+x^2 /2 for all x≥0. Taking the expectation leads to

E

[d ∑

i=1

A ̄tiYˆti^2

]

=E

[d ∑

i=1

y^2 tiAti A ̄ti

]

≤d.

The diameter is easily bounded by noting thatFis negative inco(A) and using the Cauchy-Schwarz inequality:

diamF(co(A)) = sup a∈co(A)

∑d

i=1

(

ailog(ai)−ai+A ̄ 1 i+A ̄ 1 ilog

(

1

A ̄ 1 i

))

≤m+

∑d

i=1

A ̄ 1 ilog

(

1

A ̄ 1 i

)

≤m(1 + log(d/m)).

Putting together the pieces shows that

Rn≤

m(1 + log(d/m)) η

+

ηnd 2

=

√

2 nmd(1 + log (d/m)).

Algorithm 17 plays mirror descent on the convex hull of the actions, which has dimensiond−1. In principle it would be possible to do the same thing on the set of distributions over actions, which has dimension|A|−1. Repeating the analysis leads to a suboptimal regret ofO(m

√

dnlog(d/m)). We encourage the reader to go through this calculation to see where things go wrong.

Like in Section 30.2, the main problem is computation. In each round the algorithm needs to find a distributionPtoverAsuch that

∑

a∈APt(a) =A ̄t. Feasibility follows from the definition ofco(A) while Carath ́eodory’s theorem proves the support ofPtnever needs to be larger thand+ 1. SinceAis finite we can write this problem in terms of linear constraints, but naively the computation complexity is polynomial ink, which is exponential inm. The algorithm also needs to computeA ̄t+1fromA ̄tandYˆt. This is a convex optimization problem, but the computation complexity depends on the representation ofAand may be intractable.

30.4 Follow the perturbed leader

In this section we describe an efficient algorithm for online combinatorial optimization under the assumption that for ally∈[0,∞)dthe optimization problem of finding

a∗= argmina∈A〈a,y〉 (30.2)

Bandit Algorithms

E

]

=E

]

(

(

1

))

(

1

)

+

=

√

√

∑

30.4 Follow the perturbed leader

Get our desktop app

Company

Features

Documentation

Resources