Bandit Algorithms

(Jeff_L) #1
4.10 Exercises 67

considers so-called coherence risk measures (CVaR, is one example of such a risk
measure), and with an approach where the regret itself is redefined. Value-at-Risk
is considered in the context of a specific bandit policy family by Audibert et al.
[2007, 2009].

4.10 Exercises


4.1(Positivity of the regret) Prove Lemma 4.4.

4.2(Uniqueness of law) Prove Proposition 4.8.

4.3(Definition of canonical probability measure) Prove that the measure
defined in terms of the density in Eq. (4.7) satisfies the conditions (a) and (b) in
Section 4.6.

Hint Use the properties of the Radon-Nikodym derivative in combination with
Fubini’s theorem.
4.4(Regret decomposition and canonical model for large action
spaces) Letνbe a bandit on measurable action space (A,G) andπ 1 ,...,πnbe
a policy satisfying the conditions in Section 4.7.

(a)Show that all quantities in Lemma 4.6 are appropriately defined and
measurable.
(b) Prove Lemma 4.6.
(c) Prove that Proposition 4.8 continues to hold.

4.5(Canonical model for contextual bandit) LetAandCbe finite sets.
A stochastic contextual bandit is like a normal stochastic bandit, but in each
round the learner first observes a contextCt∈ C. They then choose an action
At∈Aand receive a rewardXt∼PAt,Ct.

(a)Suppose that C 1 ,...,Cn is sampled independently from distribu-
tion ξ on C. Construct the canonical probability space that carries
C 1 ,A 1 ,X 1 ,...,Cn,An,Xn.
(b)What changes whenCtis allowed to depend onC 1 ,A 1 ,X 1 ,...,Ct− 1 ,At− 1 ,Xt− 1?

4.6(Bernoulli environment implementation) Implement a Bernoulli bandit
environment in Python using the code snippet below (or adapt to your favorite
language).

class BernoulliBandit:
# accepts a list of K >= 2 floats , each lying in [0 ,1]
def __init__(self , means):
pass

# Function should return the number of arms
Free download pdf