Bandit Algorithms

(Jeff_L) #1
38.5 Upper confidence bounds for reinforcement learning 488

in sequential context. The algorithm that establishes Theorem 38.6 is called
UCRL2 because it is the second version of the ‘upper-confidence bounds for
reinforcement learning’ algorithm. Its pseudocode is shown in Algorithm 27.
At the start of each phase, UCRL2 computes an optimal policy for the
statistically plausible MDP with the largest optimal gain. The details of this
computation are left to the next section. This policy is then implemented until
the number of visits to some state-action pair doubles when a new phase starts
and the process begins again. The use of phases is important, not just for
computational efficiency. Recalculating the optimistic policy in each round may
lead to a dithering behavior in which the algorithm frequently changes its plan
and suffers linear regret (Exercise 38.19).
To complete the specification of the algorithm, we must define confidence
sets on the unknown quantity, which in this case is the transition matrix. The
confidence sets are centered at the empirical transition probabilities defined by

Pˆt,a(s,s′) =

∑t
u=1I{Su=s,Au=a,Su+1=s′}
1 ∨Tt(s,a)

,


whereTt(s,a) =


∑t
u=1I{Su=s,Au=a}is the number of times actionawas
taken in states. As before we letPˆt,a(s) be the vector whoses′th entry is
Pˆt,a(s,s′). Given a state-action pairs,adefine

Ct(s,a) =

{


P∈P(S) :‖P−Pˆt− 1 ,a(s)‖ 1 ≤


SLt− 1 (s,a)
1 ∨Tt− 1 (s,a)

}


, (38.13)


where forTt(s,a)>0 we set


Lt(s,a) = 2 log

(


4SATt(s,a)(1 +Tt(s,a))
δ

)


and forTt(s,a) = 0 we setLt(s,a) = 1. Note that in this caseCt+1(s,a) =P(S).
Then define


Ct={P= (Pa(s))s,a:Pa(s)∈Ct(s,a) for alls,a∈S×A}. (38.14)

ClearlyTt(s,a) cannot be larger than the total number of roundsnso

Lt(s,a)≤L= 2 log

(


4SAn(n+ 1)
δ

)


. (38.15)


The algorithm operates in phasesk= 1, 2 , 3 ,...with the first phase starting in
roundτ 1 = 1 and the (k+ 1)th phase starting in roundτk+1defined inductively
by


τk+1= 1 + min{t:Tt(St,At)≥ 2 Tτk− 1 (St,At)},

which means that the next phase starts once the number of visits to some
state-action pair at least doubles.

Free download pdf