Bandit Algorithms

(Jeff_L) #1
38.5 Upper confidence bounds for reinforcement learning 489

1:Input S,A,r,δ∈(0,1)
2:t= 0
3:fork= 1, 2 ,...do
4: τk=t+ 1
5: Findπkas the greedy policy with respect tovksatisfying Eq. (38.16)
6: do
7: t←t+ 1, observeStand take actionAt=πk(St)
8: whileTt(St,At)< 2 Tτk− 1 (St,At)
9:end for
Algorithm 27:UCRL2.

38.5.1 The extended Markov decision process


The confidence setCtdefines a set of plausible transition probability functions at
the start of roundt. Since the reward function is known already this corresponds
to a set of plausible MDPs. The algorithm plays according to the optimal policy
in the plausible MDP with the largest gain. There is some subtlety because
the optimal policy is not unique and what is really needed is to find a policy
that is greedy with respect to a value function satisfying the Bellman optimality
equation in the plausible MDP with the largest gain. Precisely, at the start of
thekth phase the algorithm must find a value functionvk, gainρkand MDP
Mk= (S,A,Pk,r) withPk∈Cτksuch that

ρk+vk(s) = max
a∈A

ra(s) +〈Pk,a(s),vk〉for alls∈Sanda∈A,

ρk= max
s∈S

max
π∈ΠDM

max
P∈Cτk

ρπs(P),

(38.16)


whereρπs(P) is the gain of deterministic memoryless policyπstarting in states
in the MDP with transition probability functionP. The algorithm then plays
according toπkdefined as the greedy policy with respect tovk. There is quite
a lot hidden in these equations. The gain is only guaranteed to be constant
whenMkhas a finite diameter, but this may not hold for all plausible MDPs.
As it happens, however, solutions to Eq. (38.16) are guaranteed to exist and
can be found efficiently. To see why this is true we introduce theextended
Markov decision processM ̃k, which has state spaceSand state-dependent
action-spaceA ̃sgiven by

A ̃s={(a,P) :a∈A, P∈Cτk(s,a)}.

The reward function of the extended MDP isr ̃(a,P)(s) =ra(s) and the transitions
areP ̃a,P(s) =Pa(s). The action-space in the extended MDP allows the agent to
choose botha∈ Aand a plausible transition vectorPa(s)∈ Cτk(s,a). By the
definition of the confidence sets, for any pair of statess,s′and actiona∈Athere
always exists a transition vectorPa(s)∈Cτk(s,a) such thatPa(s,s′)>0, which
means thatM ̃kis strongly connected. Hence solving the Bellman optimality
Free download pdf