38.10 Exercises 509
following way. First define the empirical reward at the start of thekth phase by
rˆk,a(s) =
τ∑k− 1
u=1
I{Su=s,Au=a}Xu
1 ∨Tτk− 1 (s,a).
Then let ̃rt,a(s) be an upper confidence bound given by
̃rk,a(s) = ˆrt,a(s) +
√
L
2(1∨Tτk− 1 (s,a)),
whereLis as in the proof of Theorem 38.6. The modified algorithm operates
exactly like Algorithm 27, but replaces the unknownra(s) with ̃rk,a(s) when
solving the extended MDP. Prove that with probability at least 1− 3 δ/2 the
modified policy in the modified setting has regret at most
Rˆn≤CD(M)S
√
nA log
(
nSA
δ
)
,
whereC >0 is a universal constant.
38.25(Lower bound) In this exercise you will prove the claims to complete
the proof of the lower bound.
(a) Prove Claim 38.9.
(b) Prove Claim 38.10.
(c) Prove Claim 38.11.
38.26(Contextual bandits as MDPs) Consider the MDPM= (S,A,P,r)
wherePa(s) =pfor some fixed categorical distributionpfor any (s,a)∈S×A,
wheremins∈Sp(s)>0. Assume that the rewards for actionain statesare
sampled from a distribution supported on [0,1] (see Note 3). An MDP like this
defines nothing but a contextual bandit.
(a) Derive the optimal policy and the average optimal reward.
(b)Show an optimal value function that solves the Bellman optimality equation.
(c) Prove that the diameter of this MDP isD= maxs 1 /p(s).
(d)Consider the algorithm that puts one instance of an appropriate version of
UCB into every state (the same idea was explored in the context of adversarial
bandits in Section 18.1). Prove that the expected regret of your algorithm
will be at mostO(
√
SAn).
(e)Does the scaling behavior of the upper bound in Theorem 38.6 match the
actual scaling behavior of the expected regret of UCRL2 in this example?
Why or why not?
(f) Design and run an experiment to confirm your claim.
38.27(Implementation) This is a thinking and coding exercise to illustrate the
difficulty of learning in Markov decision processes. The RiverSwim environment
is originally due to Strehl and Littman [2008]. The environment has two actions