Bandit Algorithms

(Jeff_L) #1
38.2 Optimal policies and the Bellman optimality equation 480

38.2 Optimal policies and the Bellman optimality equation


We now define the notion of an optimal policy and outline the proof that there
exists a deterministic memoryless optimal policy. Along the way we define what is
called the Bellman optimality equation. Methods that solve this equation are the
basis for finding optimal policies in an efficient manner and also play a significant
role in learning algorithms. Throughout we fix a strongly connected Markov
decision processM.
Thegainof a policyπis the long-term average reward expected from using
that policy when starting in states.

ρπs= limn→∞

1


n

∑n

t=1

Eπ[rAt(St)|S 1 =s],

whereEπdenotes the expectation on the interaction sequence when policyπ
interacts with MDPM. In general the limit need not exist, so we also introduce

ρ ̄πs= lim sup
n→∞

1


n

∑n

t=1

Eπ[rAt(St)|S 1 =s],

which exists for any policy. Of course, wheneverρπsexists we haveρπs=ρ ̄πs. The
optimal gainis a real value

ρ∗= maxs∈Ssup
π

ρ ̄πs,

where the supremum is taken over all policies. Aπpolicy is anoptimal policy
ifρπ=ρ∗ 1. For strongly connected MDPs an optimal policy is guaranteed to
exist. This is far from trivial, however, and we will spend the next little while
outlining the proof.

MDPs that are not strongly connected may not have a constant optimal
gain. This is makes everything more complicated and we’re lucky not to
have to deal with them here.

Before continuing we need some new notation. For memoryless policyπdefine

Pπ(s,s′) =


a∈A

π(a|s)Pa(s,s′) and rπ(s) =


a∈A

π(a|s)ra(s). (38.1)

We viewPπas an S×Stransition matrixandrπas a vector inRS. With this
notationPπis the transition matrix of the homogeneous Markov chainS 1 ,S 2 ,...
whenAt∼π(·|St). The gain of a memoryless policyπsatisfies

ρπ= limn→∞

1


n

∑n

t=1

Pπt−^1 rπ=Pπ∗rπ, (38.2)

wherePπ∗=limn→∞^1 n

∑n
t=1P
πt−^1 is called thestationary transition matrix,
Free download pdf