30.7 Exercises 351
30.7 Exercises
30.1(Mirror descent for combinatorial bandits) Prove Theorem 30.1.
30.2(Expected supremum norm of Laplace) LetZbe sampled from
measure onRdwith densityf(z) = 2−dexp(−‖z‖ 1 ). The purpose of this exercise
is to show that
E[‖Z‖∞] =
∑d
i=1
1
i
. (30.13)
(a)LetX 1 ,...,Xdbe independent standard exponentials. Show that‖Z‖∞and
max{X 1 ,...,Xd}have the same law.
(b) LetMj= maxi≤jXi. Prove forj≥2 that
E[Mj] =E[Mj− 1 ] +E[exp(−Mj− 1 )].
(c) Prove by induction or otherwise that for alla,j∈{ 1 , 2 ,...},
E[exp(−aMj)] =
a!
∏a
b=1(j+b)
.
(d) Prove the claim in Eq. (30.13).
30.3(Gradient of expected support function) LetA⊂Rdbe a compact
convex set andφ(x) =maxa∈A〈a,x〉its support function. LetQbe a measure
onRdthat is absolutely continuous with respect to the Lebesgue measure and
letZ∼Q. Show that
∇E[φ(x+Z)] =E[argmaxa∈A〈a,x+Z〉].
30.4(Minimax bound for combinatorial semibandits) Adapt the analysis
in Exercise 28.13 to derive an algorithm for combinatorial bandits with semibandit
feedback for which the regret isRn≤C
√
mdnfor universal constantC >0.
30.5(Lower bound for combinatorial semibandits) Letm≥1 and that
d=kmfor somek >1. Prove that for any algorithm there exists a combinatorial
semibandit such thatRn≥cmin{nm,
√
mdn}wherec >0 is a universal constant.
Hint The most obvious choice is to chooseA={a∈ { 0 , 1 }d:‖a‖ 1 =m},
which are sometimes calledm-sets. A lower bound does hold for this action set
[Lattimore et al., 2018]. However an easier path is to impose a little additional
structure and analyze the multitask bandit setting.
30.6(Follow the perturbed leader form-sets) Use the ideas in Note 5 to
prove that FTPL hasRn=O ̃(
√
mnd) regret whenA={a∈{ 0 , 1 }d:‖a‖ 1 =m}.
Hint After proving the off-diagonal elements of the Hessian are negative you
will also need to tune the learning rate. We do not know of a source for this
result, but the full information case was studied by Cohen and Hazan [2015].