Bandit Algorithms

38.2 Optimal policies and the Bellman optimality equation 480

38.2 Optimal policies and the Bellman optimality equation

We now define the notion of an optimal policy and outline the proof that there exists a deterministic memoryless optimal policy. Along the way we define what is called the Bellman optimality equation. Methods that solve this equation are the basis for finding optimal policies in an efficient manner and also play a significant role in learning algorithms. Throughout we fix a strongly connected Markov decision processM. Thegainof a policyπis the long-term average reward expected from using that policy when starting in states.

ρπs= limn→∞

1

n

∑n

t=1

Eπ[rAt(St)|S 1 =s],

whereEπdenotes the expectation on the interaction sequence when policyπ interacts with MDPM. In general the limit need not exist, so we also introduce

ρ ̄πs= lim sup n→∞

1

n

∑n

t=1

Eπ[rAt(St)|S 1 =s],

which exists for any policy. Of course, wheneverρπsexists we haveρπs=ρ ̄πs. The optimal gainis a real value

ρ∗= maxs∈Ssup π

ρ ̄πs,

where the supremum is taken over all policies. Aπpolicy is anoptimal policy ifρπ=ρ∗ 1. For strongly connected MDPs an optimal policy is guaranteed to exist. This is far from trivial, however, and we will spend the next little while outlining the proof.

MDPs that are not strongly connected may not have a constant optimal gain. This is makes everything more complicated and we’re lucky not to have to deal with them here.

Before continuing we need some new notation. For memoryless policyπdefine

Pπ(s,s′) =

∑

a∈A

π(a|s)Pa(s,s′) and rπ(s) =

∑

a∈A

π(a|s)ra(s). (38.1)

We viewPπas an S×Stransition matrixandrπas a vector inRS. With this notationPπis the transition matrix of the homogeneous Markov chainS 1 ,S 2 ,... whenAt∼π(·|St). The gain of a memoryless policyπsatisfies

ρπ= limn→∞

1

n

∑n

t=1

Pπt−^1 rπ=Pπ∗rπ, (38.2)

wherePπ∗=limn→∞^1 n

∑n t=1P πt−^1 is called thestationary transition matrix,

Bandit Algorithms

38.2 Optimal policies and the Bellman optimality equation

1

1

∑

∑

1

Get our desktop app

Company

Features

Documentation

Resources