Bandit Algorithms

38.5 Upper confidence bounds for reinforcement learning 488

in sequential context. The algorithm that establishes Theorem 38.6 is called UCRL2 because it is the second version of the ‘upper-confidence bounds for reinforcement learning’ algorithm. Its pseudocode is shown in Algorithm 27. At the start of each phase, UCRL2 computes an optimal policy for the statistically plausible MDP with the largest optimal gain. The details of this computation are left to the next section. This policy is then implemented until the number of visits to some state-action pair doubles when a new phase starts and the process begins again. The use of phases is important, not just for computational efficiency. Recalculating the optimistic policy in each round may lead to a dithering behavior in which the algorithm frequently changes its plan and suffers linear regret (Exercise 38.19). To complete the specification of the algorithm, we must define confidence sets on the unknown quantity, which in this case is the transition matrix. The confidence sets are centered at the empirical transition probabilities defined by

Pˆt,a(s,s′) =

∑t u=1I{Su=s,Au=a,Su+1=s′} 1 ∨Tt(s,a)

,

whereTt(s,a) =

∑t u=1I{Su=s,Au=a}is the number of times actionawas taken in states. As before we letPˆt,a(s) be the vector whoses′th entry is Pˆt,a(s,s′). Given a state-action pairs,adefine

Ct(s,a) =

{

P∈P(S) :‖P−Pˆt− 1 ,a(s)‖ 1 ≤

√

SLt− 1 (s,a) 1 ∨Tt− 1 (s,a)

}

, (38.13)

where forTt(s,a)>0 we set

Lt(s,a) = 2 log

(

4SATt(s,a)(1 +Tt(s,a)) δ

)

and forTt(s,a) = 0 we setLt(s,a) = 1. Note that in this caseCt+1(s,a) =P(S).
Then define

Ct={P= (Pa(s))s,a:Pa(s)∈Ct(s,a) for alls,a∈S×A}. (38.14)

ClearlyTt(s,a) cannot be larger than the total number of roundsnso

Lt(s,a)≤L= 2 log

(

4SAn(n+ 1) δ

)

. (38.15)

The algorithm operates in phasesk= 1, 2 , 3 ,...with the first phase starting in
roundτ 1 = 1 and the (k+ 1)th phase starting in roundτk+1defined inductively
by

τk+1= 1 + min{t:Tt(St,At)≥ 2 Tτk− 1 (St,At)},

which means that the next phase starts once the number of visits to some
state-action pair at least doubles.

Bandit Algorithms

,

{

√

}

, (38.13)

(

)

(

)

. (38.15)

Get our desktop app

Company

Features

Documentation

Resources