Bandit Algorithms

33.6 Exercises 390

results in Section 33.2. Glynn and Juneja [2015] gives further pointers to this literature, while connecting it to the bandit literature. Best-arm identification has also been considered in the adversarial setting [Jamieson and Talwalkar, 2016, Li et al., 2018a, Abbasi-Yadkori et al., 2018].

33.6 Exercises

33.1(Simple regret lower bound) Show there exists a universal constant C >0 such that for alln≥k >1 and all policiesπthere exists aν∈ Esuch thatRsimplen (π,ν)≥C

√

k/n.

33.2(Suboptimality of uniform exploration) Show there exists a universal constantC >0 such that for alln≥k >1 there exists aν∈ Esuch that Rsimplen (UE,ν)≥C

√

klog(k)/n.

33.3 LetL >0 andD⊂[0,∞)k\{ 0 }be nonempty. Show that

inf

{

‖α‖ 1 :α∈[0,∞)k,inf d∈D

〈α,d〉≥L

}

=

(

sup α∈Pk− 1

inf d∈D

〈α,d〉

)− 1

L.

33.4(Best-arm identification for Gaussian bandits) Letσ 12 ,...,σ^2 kbe fixed andE={(N(μi,σ^2 i))ki=1:μ∈Rk}be the set of Gaussian bandits with given variances. Letν∈ Ebe a bandit withμ 1 (ν)> μi(ν) for alli >1. Abbreviate μ=μ(ν) and ∆ = ∆(ν).

(a) For anyα∈[0,∞)kshow that

inf ̃ν∈Ealt(ν)

∑k

i=1

αiD(νi,ν ̃i) =

1

2 mini> 1

α 1 αi∆^2 i α 1 σ^2 i+αiσ 12.

(b) Show that ifk= 2, thenc∗(ν) = 2(σ 1 +σ 2 )^2 /∆^22.

(c) Show thatc∗(ν)≥

2 σ 12 ∆min

+

∑k

i=2

2 σ^2 i ∆^2 i

.

(d) Show that

c∗(ν)≤

2 σ 12 ∆min

+

∑k

i=2

2 σ^2 i ∆^2 i

+ 2

√√

√√ 2 σ 12 ∆^2 min

∑k

i=2

2 σi^2 ∆^2 i

. (33.9)

(e) Show that ifσ^2 i/∆^2 i=σ^21 /∆^2 minfor alli, then equality holds in Eq. (33.9).

33.5(Probably approximately correct algorithms) This exercise is about designing (ε,δ)-PAC algorithms.

Bandit Algorithms

33.6 Exercises

√

√

{

}

=

(

)− 1

L.

1

+

.

+

+ 2

√√

. (33.9)

Get our desktop app

Company

Features

Documentation

Resources