Bandit Algorithms

38.2 Optimal policies and the Bellman optimality equation 482

Theorem38.2.The following hold:

(a) There exists a pair(ρ,v)that satisfies the Bel lman optimality equation.
(b)If(ρ,v)satisfies the Bellman optimality equation, thenρ=ρ∗andπvis
optimal.

(c) There exists a deterministic memoryless optimal policy.

Proof sketch The proof of Part (a) is too long to include here, but we guide you through it in Exercise 38.10. For Part (b) let (ρ,v) satisfy the Bellman equation andπ∗=πvbe the greedy policy with respect tov. Then, by Eq. (38.2),

ρπ

∗ = limn→∞

1

n

∑n

t=1

Pπt−∗^1 rπ∗= limn→∞

1

n

∑n

t=1

Pπt−∗^1 (ρ 1 +v−Pπ∗v) =ρ 1.

Next letπbe an arbitrary Markov policy. We show thatρ ̄π≤ρ 1. The result
is then completed using the result of Exercise 38.2, where you will prove that
for any policyπthere exists a Markov policy with the same expected rewards.
Denote byπtthe memoryless policy used at timet= 1, 2 ,...when following the
Markov policyπand fort≥1 letPπ(t)=Pπ 1 ...Pπt, while fort= 0 letPπ(0)=I.
Thus,Pπ(t)(s,s′) is the probability of ending up in states′while followingπfrom
statesforttime steps. It follows thatρ ̄π=lim supn→∞^1 n

∑n t=1P π(t−1)r πt. Fix t≥1. Using the fact thatπ∗is the greedy policy with respect tovgives

Pπ(t−1)rπt=Pπ(t−1)(rπt+Pπtv−Pπtv) ≤Pπ(t−1)(rπ∗+Pπ∗v−Pπtv) =Pπ(t−1)(ρ 1 +v−Pπtv) =ρ 1 +Pπ(t−1)v−Pπ(t)v.

Taking the average of both sides overt∈[n] and then taking the limit shows
thatρ ̄π≤ρ 1 , finishing the proof. Part (c) follows immediately from the first two
parts.

The theorem shows that there exist solutions to the Bellman optimality equation and that the greedy policy with respect to the resulting value function is an optimal policy. We need one more result about solutions to the Bellman optimality equation, the proof of which you will provide in Exercise 38.13.

Lemma38.3.Suppose that(ρ,v)satisfies the Bel lman optimality equation. Then span(v)≤D(M).

The mapT:RS→RSdefined by (Tv)(s) =maxa∈Ara(s) +〈Pa(s),v〉is called theBellman operator. The Bellman optimality equation can be written asρ 1 +v=Tv.

Bandit Algorithms

1

1

Get our desktop app

Company

Features

Documentation

Resources