Bandit Algorithms

4.7 The canonical bandit model for uncountable action sets ( ) 64

interest (usually the regret) only depends on the law ofA 1 ,X 1 ,...,An,Xnand this is the same for all choices.

Infinite horizon We never need the canonical bandit model for the case thatn=∞. It is comforting to know, however, that there does exist a probability space (Ω,F,Pνπ) and infinite sequences of random variablesX 1 ,X 2 ,...andA 1 ,A 2 ,...satisfying (a) and (b). The result follows directly from the theorem of Ionescu–Tulcea (Theorem 3.3).

4.7 The canonical bandit model for uncountable action sets ( )

For uncountable action sets a little more machinery is necessary to make things rigorous. The first requirement is that the action set must be a measurable space (A,G) and the collection of distributionν= (Pa:a∈A) that defines a bandit environment must be a probability kernel from (A,G) to (R,B(R)). A policy is a sequence (πt)nt=1whereπtis a probability kernel from (Ωt− 1 ,Ft− 1 ) to (A,G) where

Ωt=

∏t

s=1

(A×R) and Ft=

⊗t

s=1

(G⊗B(R)).

The canonical bandit model is the probability measurePνπon (Ωn,Fn) obtained by taking the product of the probability kernelsπ 1 ,P 1 ,...πn,Pnand using Ionescu Tulcea (Theorem 3.3) wherePtis the probability kernel from (Ωt− 1 ×A,Ft⊗G) to (R,B(R)) given byPt(·|a 1 ,x 1 ,...,at− 1 ,xt− 1 ,at) =Pat(·).

We did not definePνπin terms of a density because there may not exist a common dominating measure for either (Pa:a∈ A) or the policy. When such measures exist, as they usually do, thenPνπmay be defined in terms of a density in the same manner as the previous section.

You will check in Exercise 4.4 that the assumptions onνandπin this section are sufficient to ensure the quantities in Lemma 4.6 are well defined and that Proposition 4.8 continues to hold in this setting without modification. Finally, in none of the definitions above do we require thatnbe finite.

4.8 Notes

1 It is not obvious why the expected value is a good summary of the reward distribution. Decision makers who base their decisions on expected values are called risk-neutral. In the example shown on the figure above, a risk-averse decision maker may actually prefer the distribution labeled asAbecause

Bandit Algorithms

4.7 The canonical bandit model for uncountable action sets ( )

(G⊗B(R)).

4.8 Notes

Get our desktop app

Company

Features

Documentation

Resources