Bandit Algorithms

38.2 Optimal policies and the Bellman optimality equation 481

the existence of which you will prove in Exercise 38.7. For eachk∈Ndefine

v(πk)=

∑k

t=1

Pπt−^1 (rπ−ρπ).

Fors∈ S,v(πk)(s) gives the total expected excess reward collected byπwhen the process starts at statesand lasts forktime steps. The(differential) value functionof a policy is a functionvπ:S →Rdefined as theCes`aro sumof the sequence (Pπt(rπ−ρπ))t≥ 0 ,

vπ= limn→∞^1 n

∑n

k=1

v(πk)= ((I−Pπ+Pπ∗)−^1 −Pπ∗)rπ. (38.3)

Note, the second equality above is nontrivial (Exercise 38.7). The definition
implies thatvπ(s)−vπ(s′) is the ‘average’ long-term advantage of starting in
statesrelative to starting in states′when following policyπ. These quantities
are only defined for memoryless policies where they are also guaranteed to exist
(Exercise 38.7). The definition ofPπ∗implies thatPπ∗Pπ=Pπ∗, which in turn
implies thatPπ∗vπ= 0. Combining this with Eq. (38.2) and Eq. (38.3) shows that
for any memoryless policyπ,
ρπ+vπ=rπ+Pπvπ. (38.4)

Avalue functionis a functionv:S →Rand itsspanis given by

span(v) = max s∈S

v(s)−min s∈S

v(s).

As with other quantities, value functions are associated with vectors inRS. A
greedy policywith respect to value functionvis a deterministic memoryless
policyπvgiven by
πv(s) = argmaxa∈Ara(s) +〈Pa(s),v〉.

There may be many policies that are greedy with respect to some value functionv
due to ties in the maximum. Usually the ties do not matter, but for consistency
and for the sake of simplifying matters, we assume that ties are broken in a
systematic fashion. In particular, this makesπvwell defined for any value function.
One way to find the optimal policy is as the greedy policy with respect to a
value function that satisfies theBellman optimality equation, which is
ρ+v(s) = maxa∈A(ra(s) +〈Pa(s),v〉) for alls∈S. (38.5)

This is a system of S nonlinear equations with unknownsρ∈Randv∈RS.
The reader will notice that ifv:S →Ris a solution to Eq. (38.5), then so is
v+c 1 for any constantc∈Rand hence the Bellman optimality equation lacks
unique solutions. It isnottrue that the optimal value function is unique up to
translation, even whenMis strongly connected (Exercise 38.11). Thev-part of
a solution pair (ρ,v) of Eq. (38.5) is called anoptimal (differential) value
function.

Bandit Algorithms

Get our desktop app

Company

Features

Documentation

Resources