Bandit Algorithms

38.8 Notes 498

optimal if it achieves the supremum in this definition and such a policy always
exists as long as the MDP is finite. In strongly connected MDPs the two
definitions coincide. For infinite MDPs everything becomes more delicate and
a large portion of the literature on MDPs is devoted to this case.
8 In applications where the asymptotic nature of gain optimality is unacceptable
there are criteria that make finer distinctions between the policies. A memoryless
policyπ∗isbias optimalif it is gain optimal andvπ∗≥vπfor all memoryless
policiesπ. Even more sensitive criteria exist. Some keywords to search for are
Blackwell optimalityandn-discountoptimality.
9 TheCes

aro sumof a real-valued sequence (an)nis the asymptotic average of its partial sums. Letsn=a 0 +···+an− 1 be thenth partial sum. The Ces

aro
sum of this sequence isA=limn→∞^1 n(s 1 +···+sn) when this limit exists.
The idea is that Ces

aro summation smoothes out periodicity, which means that for certain sequences the Ces ́aro sum exists whilesndoes not converge. For example, the alternating sequence (+1,− 1 ,+1,− 1 ,...) is Ces

aro summable
and its Ces

aro sum is easily seen to be 1/2, while it is not summable in the normal sense. If a sequence is summable, then its sum and its Ces

aro sum
coincide. The differential value of a policy is defined as a Ces`aro sum so that it
is well defined even if the underlying Markov chain has periodic states.
10 Forγ ∈(0,1) theγ-discounted average of sequence (an)n isAγ = (1−
γ)

∑∞

n=0γ

nan. An elementary argument shows that ifAγis well defined, then Aγ= (1−γ)^2

∑∞

n=1γ

n− (^1) sn. Suppose the Ces ́aro sumA=limn→∞ 1
n
∑n
t=1st
exists, then using the fact that 1 = (1−γ)^2

∑∞

n=1γ

n− (^1) nwe haveAγ−A=
(1−γ)^2

∑∞

n=1γ

n− (^1) (sn−nA). It is not hard to see that|∑∞
n=1γ
n− (^1) (sn−nA)|=
O(1/(1−γ)) and thusAγ−A=O(1−γ) asγ→1, which means that
limγ→ 1 Aγ=A. The valuelimγ→ 1 Aγis called theAbel sumof (an)n. Put
simply, the Abel sum of a sequence is equal to its Ces

aro sum when the latter exists. Abel summation is stronger in the sense that there are sequences that are Abel summable but not Ces ́aro summable. The approach of approximating Ces

aro sums throughγ-discounted averages and taking the limit asγ→1 is
called thevanishing discount approachand is one of the standard ways
to prove that the (average reward) Bellman equation has a solution (see
Exercises 38.9 and 38.10). As an aside, the systematic study of how to define
the ‘sum’ of a divergent series is a relatively modern endeavour. An enjoyable
historical account is given in the first chapter of the book on the topic by
Hardy [1973].
11 Given a solution (ρ,v) to Eq. (38.6) we mentioned a procedure for finding
a states ̃∈ S that is recurrent under some optimal policy. This works as
follows. LetC 0 ={(s,a) :ρ+v(s) =ra(s) +〈Pa(s),v〉}and I 0 ={s:
(s,a)∈C 0 for somea∈ A}. Then defineCk+1andIk+1inductively by the
following algorithm. First find an (s,a)∈Cksuch thatPa(s,s′)>0 for some
s′6∈Ik. If no such pair exists then halt. Otherwise letCk+1=Ck{(s,a)}
andIk+1={s: (s,a)∈Ck+1for somea∈ A}. Now use the complementary
slackness conditions of the dual program to Eq. (38.6) to prove that the

Bandit Algorithms

∑∞

∑∞

∑∞

∑∞

Get our desktop app

Company

Features

Documentation

Resources