Bandit Algorithms

(Jeff_L) #1
30.7 Exercises 351

30.7 Exercises


30.1(Mirror descent for combinatorial bandits) Prove Theorem 30.1.

30.2(Expected supremum norm of Laplace) LetZbe sampled from
measure onRdwith densityf(z) = 2−dexp(−‖z‖ 1 ). The purpose of this exercise
is to show that

E[‖Z‖∞] =

∑d

i=1

1


i

. (30.13)


(a)LetX 1 ,...,Xdbe independent standard exponentials. Show that‖Z‖∞and
max{X 1 ,...,Xd}have the same law.
(b) LetMj= maxi≤jXi. Prove forj≥2 that
E[Mj] =E[Mj− 1 ] +E[exp(−Mj− 1 )].

(c) Prove by induction or otherwise that for alla,j∈{ 1 , 2 ,...},

E[exp(−aMj)] =

a!
∏a
b=1(j+b)

.


(d) Prove the claim in Eq. (30.13).

30.3(Gradient of expected support function) LetA⊂Rdbe a compact
convex set andφ(x) =maxa∈A〈a,x〉its support function. LetQbe a measure
onRdthat is absolutely continuous with respect to the Lebesgue measure and
letZ∼Q. Show that

∇E[φ(x+Z)] =E[argmaxa∈A〈a,x+Z〉].

30.4(Minimax bound for combinatorial semibandits) Adapt the analysis
in Exercise 28.13 to derive an algorithm for combinatorial bandits with semibandit
feedback for which the regret isRn≤C


mdnfor universal constantC >0.

30.5(Lower bound for combinatorial semibandits) Letm≥1 and that
d=kmfor somek >1. Prove that for any algorithm there exists a combinatorial
semibandit such thatRn≥cmin{nm,


mdn}wherec >0 is a universal constant.

Hint The most obvious choice is to chooseA={a∈ { 0 , 1 }d:‖a‖ 1 =m},
which are sometimes calledm-sets. A lower bound does hold for this action set
[Lattimore et al., 2018]. However an easier path is to impose a little additional
structure and analyze the multitask bandit setting.
30.6(Follow the perturbed leader form-sets) Use the ideas in Note 5 to
prove that FTPL hasRn=O ̃(


mnd) regret whenA={a∈{ 0 , 1 }d:‖a‖ 1 =m}.

Hint After proving the off-diagonal elements of the Hessian are negative you
will also need to tune the learning rate. We do not know of a source for this
result, but the full information case was studied by Cohen and Hazan [2015].
Free download pdf