Bandit Algorithms

(Jeff_L) #1
18.7 Exercises 224

18.3 In this exercise you will prove an analogue of Theorem 12.1 for Exp4-IX.
In the contextual setting the random regret is


Rˆn= max
m∈[M]

∑n

t=1

(


E(mt)xt−Xt

)


.


Design an algorithm accepting parameterδ∈(0,1) such that

P


(


Rˆn≥C

(



nklog(m) +


nk
log(m)

log

(


1


δ

)))


≤δ.

18.4 Prove Theorem 18.3.


Hint The key idea is to modify the analysis of Exp3 to handle decreasing
learning rates. Of course you can do this directly yourself, or you can peek ahead
to Chapter 28 and specifically Exercises 28.10 and 28.11.

18.5 Letx 1 ,...,xnbe a sequence of reward vectors chosen in advance by
an adversary withxt∈[0,1]k. Furthermore, leto 1 ,...,onbe a sequence of
observations, also chosen in advance by an adversary withot∈[O] for some
fixedO∈N+. Then letHbe the set of functionsφ: [O]m→[k] wherem∈N+.
In each round the learner observesotshould choose an actionAtbased on
o 1 ,A 1 ,X 1 ,...,ot− 1 ,At− 1 ,Xt− 1 ,otand the regret is


Rn= minφ∈H

∑n

t=1

xtAt−xtφ(ot,ot− 1 ,...,ot−m),

whereot= 1 fort≤0. This means the learner is competing with the best
predictor in hindsight that uses only the lastmobservations. Prove there exists
an algorithm such that


E[Rn]≤


2 nmklog(O).

18.6 In this problem we consider nonoblivious experts. Consider the following
modified regret definition:


R′n= max
m∈[M]

E


[n

t=1

E(mt)xt−

∑n

t=1

Xt

]


.


Show that:

(a)R′n≤Rnregardless of whether the experts are oblivious or not.
(b)Theorem 18.1 remains valid for nonoblivious experts if in Eq. (18.7) we
replaceRnwithR′n. In particular, explain how to modify the proof.
(c) Research question: Give a non-trivial bound onRn.


18.7 Prove Eq. (18.13).


18.8 (Explore-then-commit) Consider a stochastic contextual bandit
environment where (Ct)nt=1is a sequence of contexts sampled from distribution

Free download pdf