Bandit Algorithms

4.10 Exercises 67

considers so-called coherence risk measures (CVaR, is one example of such a risk measure), and with an approach where the regret itself is redefined. Value-at-Risk is considered in the context of a specific bandit policy family by Audibert et al. [2007, 2009].

4.10 Exercises

4.1(Positivity of the regret) Prove Lemma 4.4.

4.2(Uniqueness of law) Prove Proposition 4.8.

4.3(Definition of canonical probability measure) Prove that the measure defined in terms of the density in Eq. (4.7) satisfies the conditions (a) and (b) in Section 4.6.

Hint Use the properties of the Radon-Nikodym derivative in combination with Fubini’s theorem. 4.4(Regret decomposition and canonical model for large action spaces) Letνbe a bandit on measurable action space (A,G) andπ 1 ,...,πnbe a policy satisfying the conditions in Section 4.7.

(a)Show that all quantities in Lemma 4.6 are appropriately defined and measurable. (b) Prove Lemma 4.6. (c) Prove that Proposition 4.8 continues to hold.

4.5(Canonical model for contextual bandit) LetAandCbe finite sets. A stochastic contextual bandit is like a normal stochastic bandit, but in each round the learner first observes a contextCt∈ C. They then choose an action At∈Aand receive a rewardXt∼PAt,Ct.

(a)Suppose that C 1 ,...,Cn is sampled independently from distribu- tion ξ on C. Construct the canonical probability space that carries C 1 ,A 1 ,X 1 ,...,Cn,An,Xn. (b)What changes whenCtis allowed to depend onC 1 ,A 1 ,X 1 ,...,Ct− 1 ,At− 1 ,Xt− 1?

4.6(Bernoulli environment implementation) Implement a Bernoulli bandit environment in Python using the code snippet below (or adapt to your favorite language).

class BernoulliBandit: # accepts a list of K >= 2 floats , each lying in [0 ,1] def __init__(self , means): pass

# Function should return the number of arms

Bandit Algorithms

4.10 Exercises

Get our desktop app

Company

Features

Documentation

Resources