Bandit Algorithms

38.10 Exercises 508

Hint Since UCRL2 and the environment are both deterministic you can examine the behavior of the algorithm on the MDP. You should aim to prove that eventually the algorithm will alternate between actionsstayandgo. 38.20(Extended MDP is strongly connected) LetM ̃ be the extended MDP defined in Section 38.5.1. Prove thatP∈ Ctimplies thatM ̃is strongly connected.

38.21(Confidence sets) Prove Lemma 38.8.

Hint Use the result of Exercise 5.17 and apply a union bound over all state- action pairs and the number of samples. Use the Markov property to argue that the independence assumption in Exercise 5.17 is not problematic. 38.22 Let (ak) and (Ak) be nonnegative numbers so that for anyk ≥0, ak+1≤Ak= 1∨(a 1 +···+ak). Prove that for anym≥1, ∑m

k=1

ak Ak− 1

≤

(√

2 + 1

)√

Am.

Hint The statement is trivial if

∑m− 1 k=1ak ≤1. If this does not hold, use induction based on∑ m=n,n+ 1,... wherenis the first integer such that n− 1 k=1ak>1. 38.23(Alternative phases) The definition of phases in UCRL2 can be modified in many ways with essentially the same regret.

(a)Which step of the proof of Theorem 38.6 would break if we modified UCRL2
to recompute the optimistic policy at the beginning of each round?
(b)Imagine that we modified UCRL2 to start a new phase after every

√

nsteps.
Will this variant of UCRL2 enjoy a similar regret bound to the one stated in
Theorem 38.6?
(c)Now imagine that UCRL2 is modified to start a new phase in rounds
τk+1=τk+f(k) with some functionf :N→N. Propose some choices
offunder which the regret bound for the new variant stays essentially the
same as it was before.
(d)Discuss the pros and cons of the different choices of the various phase
definitions.

38.24(Unknown rewards) In this exercise you will modify the algorithm
to handle the situation whereris unknown and rewards are stochastic. More
precisely, assume there exists a functionra(s)∈[0,1] for alla∈ Aands∈ S.
Then in each round the learner observesSt, chooses an actionAtand receives a
rewardXt∈[0,1] with

E[Xt|At,St] =rAt(St).

In order to accommodate the unknown reward function we modify UCRL2 in the

Bandit Algorithms

≤

(√

2 + 1

)√

√

Get our desktop app

Company

Features

Documentation

Resources