Bandit Algorithms

(Jeff_L) #1
16.5 Exercises 202

E=Mkandπbe a consistent policy forE. Prove that the asymptotic upper
bound on the regret proven in Exercise 10.5 is tight.

16.4(Unknown subgaussian constant) Let


M=

{


P: there exists aσ^2 ≥0 such thatPisσ^2 -subgaussian

}


.


(a) Find a distributionPsuch thatP /∈M.
(b)Suppose thatP∈Mhas meanμ∈R. Prove thatdinf(P,μ∗,M) = 0 for all
μ∗> μ.
(c)LetE={(Pi) :Pi∈Mfor all 1≤i≤k}. Prove that ifk >1, then for all
consistent policiesπ,


lim infn→∞ Rn(π,ν)
log(n)

=∞ for allν∈E.

(d)Letf:N→[0,∞) be any increasing function withlimn→∞f(n)/log(n) =∞.
Prove there exists a policyπsuch that


lim sup
n→∞

Rn(π,ν)
f(n)

= 0 for allν∈E,

whereEis as in the previous part.
(e) Conclude there exists a consistent policy forE.

16.5(Minimax lower bound) Use Lemma 16.3 to prove Theorem 15.2, possibly
with different constants.


16.6(Refining the lower-order terms) Letk= 2 and forν∈ EN^2 let
∆(ν) =max{∆ 1 (ν),∆ 2 (ν)}. Suppose thatπis a policy such that for allν∈EN^2
with ∆(ν)≤1 it holds that


Rn(π,ν)≤

Clog(n)
∆(ν)

. (16.6)


(a) Give an example of a policy satisfying Eq. (16.6).
(b)Assume that i = 2 is suboptimal forν and α ∈ (0,1) be such that
Eνπ[T 2 (n)] = 2∆(^1 ν) 2 log(α). Letν′be the alternative environment where
μ 1 (ν′) =μ 1 (ν) andμ 2 (ν′) =μ 1 (ν) + 2∆(ν). Show that


exp(−D(Pνπ,Pν′π)) =^1
α

.


(c) LetAbe the event thatT 2 (n)≥n/2. Show that

Pνπ(A)≤

2 Clog(n)
n∆(ν)^2

and Pν′π(A)≥

1


2 α


2 Clog(n)
n∆(ν)^2

.


(d) Show that


Rn(π,ν′)≥

n∆(ν)
2

(


1


2 α


2 Clog(n)
n∆(ν)^2

)


.

Free download pdf