Bandit Algorithms

38.5 Upper confidence bounds for reinforcement learning 489

1:Input S,A,r,δ∈(0,1) 2:t= 0 3:fork= 1, 2 ,...do 4: τk=t+ 1 5: Findπkas the greedy policy with respect tovksatisfying Eq. (38.16) 6: do 7: t←t+ 1, observeStand take actionAt=πk(St) 8: whileTt(St,At)< 2 Tτk− 1 (St,At) 9:end for Algorithm 27:UCRL2.

38.5.1 The extended Markov decision process

The confidence setCtdefines a set of plausible transition probability functions at the start of roundt. Since the reward function is known already this corresponds to a set of plausible MDPs. The algorithm plays according to the optimal policy in the plausible MDP with the largest gain. There is some subtlety because the optimal policy is not unique and what is really needed is to find a policy that is greedy with respect to a value function satisfying the Bellman optimality equation in the plausible MDP with the largest gain. Precisely, at the start of thekth phase the algorithm must find a value functionvk, gainρkand MDP Mk= (S,A,Pk,r) withPk∈Cτksuch that

ρk+vk(s) = max a∈A

ra(s) +〈Pk,a(s),vk〉for alls∈Sanda∈A,

ρk= max s∈S

max π∈ΠDM

max P∈Cτk

ρπs(P),

(38.16)

whereρπs(P) is the gain of deterministic memoryless policyπstarting in states in the MDP with transition probability functionP. The algorithm then plays according toπkdefined as the greedy policy with respect tovk. There is quite a lot hidden in these equations. The gain is only guaranteed to be constant whenMkhas a finite diameter, but this may not hold for all plausible MDPs. As it happens, however, solutions to Eq. (38.16) are guaranteed to exist and can be found efficiently. To see why this is true we introduce theextended Markov decision processM ̃k, which has state spaceSand state-dependent action-spaceA ̃sgiven by

A ̃s={(a,P) :a∈A, P∈Cτk(s,a)}.

The reward function of the extended MDP isr ̃(a,P)(s) =ra(s) and the transitions areP ̃a,P(s) =Pa(s). The action-space in the extended MDP allows the agent to choose botha∈ Aand a plausible transition vectorPa(s)∈ Cτk(s,a). By the definition of the confidence sets, for any pair of statess,s′and actiona∈Athere always exists a transition vectorPa(s)∈Cτk(s,a) such thatPa(s,s′)>0, which means thatM ̃kis strongly connected. Hence solving the Bellman optimality

Bandit Algorithms

(38.16)

Get our desktop app

Company

Features

Documentation

Resources