4.7 The canonical bandit model for uncountable action sets ( ) 64
interest (usually the regret) only depends on the law ofA 1 ,X 1 ,...,An,Xnand
this is the same for all choices.
Infinite horizon
We never need the canonical bandit model for the case thatn=∞. It is comforting
to know, however, that there does exist a probability space (Ω,F,Pνπ) and infinite
sequences of random variablesX 1 ,X 2 ,...andA 1 ,A 2 ,...satisfying (a) and (b).
The result follows directly from the theorem of Ionescu–Tulcea (Theorem 3.3).
4.7 The canonical bandit model for uncountable action sets ( )
For uncountable action sets a little more machinery is necessary to make things
rigorous. The first requirement is that the action set must be a measurable space
(A,G) and the collection of distributionν= (Pa:a∈A) that defines a bandit
environment must be a probability kernel from (A,G) to (R,B(R)). A policy is
a sequence (πt)nt=1whereπtis a probability kernel from (Ωt− 1 ,Ft− 1 ) to (A,G)
where
Ωt=
∏t
s=1
(A×R) and Ft=
⊗t
s=1
(G⊗B(R)).
The canonical bandit model is the probability measurePνπon (Ωn,Fn) obtained
by taking the product of the probability kernelsπ 1 ,P 1 ,...πn,Pnand using Ionescu
Tulcea (Theorem 3.3) wherePtis the probability kernel from (Ωt− 1 ×A,Ft⊗G)
to (R,B(R)) given byPt(·|a 1 ,x 1 ,...,at− 1 ,xt− 1 ,at) =Pat(·).
We did not definePνπin terms of a density because there may not exist a
common dominating measure for either (Pa:a∈ A) or the policy. When
such measures exist, as they usually do, thenPνπmay be defined in terms
of a density in the same manner as the previous section.
You will check in Exercise 4.4 that the assumptions onνandπin this section
are sufficient to ensure the quantities in Lemma 4.6 are well defined and that
Proposition 4.8 continues to hold in this setting without modification. Finally, in
none of the definitions above do we require thatnbe finite.
4.8 Notes
1 It is not obvious why the expected value is a good summary of the reward
distribution. Decision makers who base their decisions on expected values are
called risk-neutral. In the example shown on the figure above, a risk-averse
decision maker may actually prefer the distribution labeled asAbecause