Bandit Algorithms

28.7 Exercises 326

(b) Prove that any algorithm satisfying Eq. (28.14) also satisfies

Rn≤k+C

√√

k

(

1 + mina∈[k]

∑n

t=1

yta

)

log

(

n∨k k

)

,

whereCis a suitably large universal constant.

Hint For tuning the learning rate you might take inspiration from Theorem 18.3.

The algorithm in this exercise is a simplified variant of the algorithm analyzed by Wei and Luo [2018].

28.13(Minimax regret for finite-armed adversarial bandits) Let
(yt)nt=1be a sequence of loss vectors withyt ∈[0,1]k for alltand F(a) =
− 2

∑k i=1

√a i. Consider the instance of FTRL fork-armed adversarial bandits that samplesAt∈[k] fromPtdefined by

Pt= argminp∈Pk− 1 η

∑t−^1

s=1

〈p,Yˆs〉+F(p),

whereYˆsi=I{As=i}ysi/Psiis the importance-weighted estimator ofysiand
η >0 is the learning rate.

(a) Show that

Pti=

(

λ+

t∑− 1

s=1

Yˆsi

)− 2

,

whereλ∈Ris the largest value such thatPti∈Pk− 1.

(b) Show thatPt+1,At≤PtAtfor allt∈[n−1].
(c) Show that∇^2 F(x) =^12 diag(x−^3 /^2 ).
(d) Show that diamF(A)≤ 2

√

k. (e) Prove that the regret of this algorithm is bounded byRn≤

√

8 kn.
(f)What happens if you use mirror descent instead of FTRL. Are the resulting
algorithms the same? And if not, what can you prove for mirror descent?
(g) Explain how you would implement this algorithm.

The algorithm in the above exercise is called theimplicitly normalized forecasterand was introduced by Audibert and Bubeck [2009]. At first it went unnoticed that this algorithm was an instance of mirror descent and the proof was consequentially much more complicated. More details are in the book by Bubeck and Cesa-Bianchi [2012].

Bandit Algorithms

√√

√√

(

)

(

)

,

(

)− 2

,

√

√

Get our desktop app

Company

Features

Documentation

Resources