Bandit Algorithms

34.9 Exercises 408

Pθμfor allθ∈Θ and define

q(θ|x) = ∫ pθ(x)q(θ) Θpψ(x)q(ψ)dν(ψ)

,

wherepθ(x) =dPθ/dμandq(θ) =dQ/dν. You may assume thatpθ(x) is jointly
measurable inθandx(see Note 7).

(a) LetN={x:

∫

Θpψ(x)q(ψ)dν(ψ) = 0}and show thatPX(N) = 0.
(b)DefineQ(A|x) =

∫

Aq(θ|x)dν(θ) forx /∈NandQ(A|x) be an arbitrary fixed probability measure forx∈N. Show thatQ(·|X) is a regular version ofP(θ∈·|X).

Hint The ‘sections’ lemma may prove useful (Lemma 1.26 in Kallenberg 2002), along with the properties of the Radon-Nikodym derivative. 34.4(Measurability of the regret) Let (E,G,Q,P) be a Bayesian bandit environment andπa policy. Prove thatRn(π,ν) defined in Eq. (34.7) isG- measurable as a function ofν.

34.5(Bayesian optimal regret can be positive) Construct an example demonstrating that for some priors over finite-armed stochastic bandits the Bayesian regret is strictly positive: infπBRn(π,Q)>0.

Hint The key is to observe that under appropriate conditionsBRn(π,Q) = 0
would mean thatπneeds to know the identity of the optimal action underνfrom
round one, which is impossible whenνis random and the model is rich enough.
34.6(Canonical model) Prove the existence of a probability space carrying
the random variables satisfying the conditions in Section 34.4.

34.7(Policies as measures over deterministic policies) A policy π= (πt)nt=1is deterministic ifπt(·|a 1 ,x 1 ,...,at− 1 ,xt− 1 ) is a Dirac measure for alltand all action/reward sequencesa 1 ,x 1 ,...,at− 1 ,xt− 1. The space of all deterministic policies is denoted by ΠD. Choose aσ-algebraGon ΠDsuch that for all policiesπthere exists a probability measureμon (ΠD,G) such that

Pπν(B) =

∫

ΠD

Pπ′ν(B)dμ(π′).

Furthermore, show that for all probability measuresμon (ΠD,G) there exists a policyπsuch that the above display holds.

34.8(Sufficiency of deterministic policies) Let ΠDbe the set of all deterministic policies and Π the space of all policies. Prove that for anyk-armed Bayesian bandit environment (E,G,Q,P),

inf π∈Π

BRn(π,Q) = inf π∈ΠD

BRn(π,Q).

34.9 Prove that the denominator in Eq. (34.6) is almost surely nonzero.

Bandit Algorithms

,

∫

∫

∫

Get our desktop app

Company

Features

Documentation

Resources