33.6 Exercises 390
results in Section 33.2. Glynn and Juneja [2015] gives further pointers to this
literature, while connecting it to the bandit literature.
Best-arm identification has also been considered in the adversarial setting
[Jamieson and Talwalkar, 2016, Li et al., 2018a, Abbasi-Yadkori et al., 2018].
33.6 Exercises
33.1(Simple regret lower bound) Show there exists a universal constant
C >0 such that for alln≥k >1 and all policiesπthere exists aν∈ Esuch
thatRsimplen (π,ν)≥C
√
k/n.
33.2(Suboptimality of uniform exploration) Show there exists a universal
constantC >0 such that for alln≥k >1 there exists aν∈ Esuch that
Rsimplen (UE,ν)≥C
√
klog(k)/n.
33.3 LetL >0 andD⊂[0,∞)k\{ 0 }be nonempty. Show that
inf
{
‖α‖ 1 :α∈[0,∞)k,inf
d∈D
〈α,d〉≥L
}
=
(
sup
α∈Pk− 1
inf
d∈D
〈α,d〉
)− 1
L.
33.4(Best-arm identification for Gaussian bandits) Letσ 12 ,...,σ^2 kbe
fixed andE={(N(μi,σ^2 i))ki=1:μ∈Rk}be the set of Gaussian bandits with given
variances. Letν∈ Ebe a bandit withμ 1 (ν)> μi(ν) for alli >1. Abbreviate
μ=μ(ν) and ∆ = ∆(ν).
(a) For anyα∈[0,∞)kshow that
inf
̃ν∈Ealt(ν)
∑k
i=1
αiD(νi,ν ̃i) =
1
2 mini> 1
α 1 αi∆^2 i
α 1 σ^2 i+αiσ 12.
(b) Show that ifk= 2, thenc∗(ν) = 2(σ 1 +σ 2 )^2 /∆^22.
(c) Show thatc∗(ν)≥
2 σ 12
∆min
+
∑k
i=2
2 σ^2 i
∆^2 i
.
(d) Show that
c∗(ν)≤
2 σ 12
∆min
+
∑k
i=2
2 σ^2 i
∆^2 i
+ 2
√√
√√ 2 σ 12
∆^2 min
∑k
i=2
2 σi^2
∆^2 i
. (33.9)
(e) Show that ifσ^2 i/∆^2 i=σ^21 /∆^2 minfor alli, then equality holds in Eq. (33.9).
33.5(Probably approximately correct algorithms) This exercise is about
designing (ε,δ)-PAC algorithms.