Bandit Algorithms

31.1 Adversarial bandits 355

implementation of Exp4 is linear in the number of experts, which even for modestly sizedmis very large indeed.

Online stochastic mirror descent
The computational issues faced by Exp4 are most easily overcome using the
tools from online convex optimization developed in Chapter 28. The idea is to
use online stochastic mirror descent and the unnormalized negentropy potential.
Without further modification this would be Exp3, which you will show does not
work for nonstationary bandits (Exercise 31.3). The trick is to restrict the action
set to the clipped simplexA=Pk− 1 ∩[α,1]kwhereα∈[0, 1 /k] is a constant to
be tuned subsequently. The clipping ensures the algorithm does not commit too
hard to any single arm. The rationale is that a strong commitment could prevent
this discovery of change points.
LetF: [0,∞)k→Rbe the unnormalized negentropy potential andP 1 ∈Abe
the uniform probability vector. In each roundtthe learner samplesAt∼Ptand
updates its sampling distribution using
Pt+1= argminp∈Aη〈p,Yˆt〉+DF(p,Pt), (31.3)

whereη >0 is the learning rate andYˆti=I{At=i}yti/Ptiis the importance-
weighted estimator of the loss of actioni for roundt. The solution to the
optimization problem of Eq. (31.3) can be computed efficiently using the two-step
process:

P ̃t+1= argminp∈[0,∞)kη〈p,Yˆt〉+DF(p,Pt), Pt+1= argminp∈ADF(p,P ̃t+1).

The first of these subproblems can be evaluated analytically, yieldingP ̃t+1,i=
Ptiexp(−ηYˆti). The second can be solved efficiently using the result in
Exercise 26.11. The algorithm enjoys the following guarantee on its regret:

Theorem31.1. The expected regret of the policy samplingAt∼PtwithPt
defined in Eq.(31.3)is bounded by

Rnm≤αn(k−1) +

mlog(1/α) η +

ηnk 2. Proof Leta∗∈argmina∈Γnm

∑n
t=1ytatbe an optimal sequence of actions in
hindsight constrained to Γnm. Then let 1 =t 1 < t 2 <···< tm< tm+1=n+ 1
so thata∗tis constant on each interval{ti,...,ti+1− 1 }. We abuse notation by
writinga∗i=a∗ti. Then the regret decomposes into

Rnm=E

[n ∑

t=1

(ytAt−yta∗t)

]

=E

[m ∑

i=1

ti+1∑− 1

t=ti

(ytAt−yta∗t)

]

=

∑m

i=1

E

[

E

[ti+1− 1 ∑

t=ti

(ytAt−yta∗i)

∣∣

∣

Pti

]]

.

Bandit Algorithms

]

=E

]

=

E

[

E

∣∣

∣∣

∣

]]

.

Get our desktop app

Company

Features

Documentation

Resources