34.9 Exercises 408
Pθμfor allθ∈Θ and define
q(θ|x) =
∫ pθ(x)q(θ)
Θpψ(x)q(ψ)dν(ψ)
,
wherepθ(x) =dPθ/dμandq(θ) =dQ/dν. You may assume thatpθ(x) is jointly
measurable inθandx(see Note 7).
(a) LetN={x:
∫
Θpψ(x)q(ψ)dν(ψ) = 0}and show thatPX(N) = 0.
(b)DefineQ(A|x) =
∫
Aq(θ|x)dν(θ) forx /∈NandQ(A|x) be an arbitrary
fixed probability measure forx∈N. Show thatQ(·|X) is a regular version
ofP(θ∈·|X).
Hint The ‘sections’ lemma may prove useful (Lemma 1.26 in Kallenberg 2002),
along with the properties of the Radon-Nikodym derivative.
34.4(Measurability of the regret) Let (E,G,Q,P) be a Bayesian bandit
environment andπa policy. Prove thatRn(π,ν) defined in Eq. (34.7) isG-
measurable as a function ofν.
34.5(Bayesian optimal regret can be positive) Construct an example
demonstrating that for some priors over finite-armed stochastic bandits the
Bayesian regret is strictly positive: infπBRn(π,Q)>0.
Hint The key is to observe that under appropriate conditionsBRn(π,Q) = 0
would mean thatπneeds to know the identity of the optimal action underνfrom
round one, which is impossible whenνis random and the model is rich enough.
34.6(Canonical model) Prove the existence of a probability space carrying
the random variables satisfying the conditions in Section 34.4.
34.7(Policies as measures over deterministic policies) A policy
π= (πt)nt=1is deterministic ifπt(·|a 1 ,x 1 ,...,at− 1 ,xt− 1 ) is a Dirac measure
for alltand all action/reward sequencesa 1 ,x 1 ,...,at− 1 ,xt− 1. The space of all
deterministic policies is denoted by ΠD. Choose aσ-algebraGon ΠDsuch that
for all policiesπthere exists a probability measureμon (ΠD,G) such that
Pπν(B) =
∫
ΠD
Pπ′ν(B)dμ(π′).
Furthermore, show that for all probability measuresμon (ΠD,G) there exists a
policyπsuch that the above display holds.
34.8(Sufficiency of deterministic policies) Let ΠDbe the set of all
deterministic policies and Π the space of all policies. Prove that for anyk-armed
Bayesian bandit environment (E,G,Q,P),
inf
π∈Π
BRn(π,Q) = inf
π∈ΠD
BRn(π,Q).
34.9 Prove that the denominator in Eq. (34.6) is almost surely nonzero.