Bandit Algorithms

30.4 Follow the perturbed leader 345

admits an efficient solution. This feels like the minimum one could get away
with. If the static problem is too hard it seems unlikely that an online algorithm
could be efficient. In fact, an online algorithm with low regret could be used to
approximate the solution to the static problem.
Follow the perturbed leader(FTPL) operates by estimating the cumulative
losses observed so far. In each round the estimates are randomly perturbed and
the algorithm solves Eq. (30.2) using the perturbed estimates. LetLˆt=

∑t s=1Yˆs be the cumulative loss estimates after roundt, then FTPL chooses

At+1= argmina∈A〈a,ηLˆt−Zt+1〉, (30.3)

whereη >0 is the learning rate andZt+1∈Rdis sampled from a carefully chosen
distributionQ. The random perturbations introduce the exploration, which for
appropriate perturbation distributions is sufficient to guarantee small regret.
Notice that ifηis small, then the effect ofZt+1is larger and the algorithm can
be expected to explore more, which is consistent with the learning rate used in
mirror descent or exponential weighting studied in previous chapters.
Before defining the loss estimations and perturbation distribution we make a
connection between FTPL and mirror descent. Given Legendre potentialFwith
dom(∇F) = int(A) online stochastic mirror descent choosesA ̄t+1so that

A ̄t+1= argmina∈co(A)〈a,ηYˆt〉+DF(a,A ̄t).

Taking derivatives and using the fact that dom(∇F) = int(A) we have

∇F(A ̄t+1) =∇F(A ̄t)−ηYˆt=−ηLˆt.

By duality this implies thatA ̄t+1=∇F∗(−ηLˆt). Examining Eq. (30.3), we see that

A ̄t+1=E[At+1|Ft] =E

[

argmina∈A〈a,ηLˆt−Zt+1〉

∣∣

∣Ft

]

,

whereFt=σ(Z 1 ,...,Zt). In order to view FTPL as an instance of mirror descent
we find a Legendre potentialFwith

∇F∗(−ηLˆt) =E

[

argmina∈A〈a,ηLˆt−Z〉

]

=E

[

argmaxa∈A〈a,Z−ηLˆt〉

]

,

which is equivalent to∇F∗(x) =E[argmaxa∈co(A)〈a,x+Z〉]. To remove clutter
in the notation define

a(x) = argmaxa∈A〈a,x〉.

Readers with some familiarity with convex analysis will remember that if φ(x) =maxa∈A〈a,x〉is the support function andAhas a smooth boundary, then∇φ(x) =a(x). For combinatorial banditsAis not smooth, but ifQis absolutely continuous with respect to the Lebesgue measure, then you will show in Exercise 30.3 that nevertheless it is true that

∇E[φ(x+Z)] =E[a(x+Z)].

Bandit Algorithms

[

∣∣

]

,

[

]

=E

[

]

,

Get our desktop app

Company

Features

Documentation

Resources