Bandit Algorithms

(Jeff_L) #1
28.7 Exercises 326

(b) Prove that any algorithm satisfying Eq. (28.14) also satisfies


Rn≤k+C

√√


√√


k

(


1 + mina∈[k]

∑n

t=1

yta

)


log

(


n∨k
k

)


,


whereCis a suitably large universal constant.

Hint For tuning the learning rate you might take inspiration from Theorem 18.3.

The algorithm in this exercise is a simplified variant of the algorithm analyzed
by Wei and Luo [2018].

28.13(Minimax regret for finite-armed adversarial bandits) Let
(yt)nt=1be a sequence of loss vectors withyt ∈[0,1]k for alltand F(a) =
− 2


∑k
i=1

√a
i. Consider the instance of FTRL fork-armed adversarial bandits
that samplesAt∈[k] fromPtdefined by

Pt= argminp∈Pk− 1 η

∑t−^1

s=1

〈p,Yˆs〉+F(p),

whereYˆsi=I{As=i}ysi/Psiis the importance-weighted estimator ofysiand
η >0 is the learning rate.


(a) Show that


Pti=

(


λ+

t∑− 1

s=1

Yˆsi

)− 2


,


whereλ∈Ris the largest value such thatPti∈Pk− 1.

(b) Show thatPt+1,At≤PtAtfor allt∈[n−1].
(c) Show that∇^2 F(x) =^12 diag(x−^3 /^2 ).
(d) Show that diamF(A)≤ 2



k.
(e) Prove that the regret of this algorithm is bounded byRn≤


8 kn.
(f)What happens if you use mirror descent instead of FTRL. Are the resulting
algorithms the same? And if not, what can you prove for mirror descent?
(g) Explain how you would implement this algorithm.


The algorithm in the above exercise is called theimplicitly normalized
forecasterand was introduced by Audibert and Bubeck [2009]. At first it
went unnoticed that this algorithm was an instance of mirror descent and
the proof was consequentially much more complicated. More details are in
the book by Bubeck and Cesa-Bianchi [2012].
Free download pdf