35.3 1 -armed bandits 413
This means that (St)tis a Markov chain over the state spaceSwith kernel
(Px)x∈SandUt=u(St) is the utility of stateStvisited at timet. LetPxbe a
measure satisfying the above conditions as well asPx(S 1 =x) = 1. As usual
letExdenote the expectation with respect toPx. Define thevalue function
v:S →Rby
v(x) = sup
τ∈R 1
Ex[Uτ], (35.2)
whereR 1 is the set ofF= (σ(S 1 ,...,St)t)-adapted stopping times. The result
provided by the next theorem is sufficient for our requirements.
Theorem35.3. Assume for al lx∈SthatU∞=limn→∞UnexistsPx-a.s. and
supn≥ 1 |Un|isPx-integrable. Thenvsatisfies theWald–Bellman equation,
v(x) = max{u(x),
∫
S
v(y)Px(dy)} for allx∈S.
Furthermore,limn→∞v(Sn) =U∞Px-a.s. and the supremum in Eq.(35.2)is
achieved by any stopping timeτsuch that for al lt:
(a)τ≤ton the event thatUt>
∫
Sv(y)PSt(dy).
(b)τ > ton the event thatUt<
∫
Sv(y)PSt(dy)andτ≥t.
A natural choice of stopping time satisfying conditions (a) and (b) in
Theorem 35.3 isτ=min{t≥1 :v(St) =Ut}. The conditions represent a
possible indifference region where both stopping and continuing lead to the
same expected utility.
The proof of Theorem 35.2 is straightforward (Exercise 35.2). Measurability
issues make the proof of Theorem 35.3 more technical (Exercise 35.3). Pointers
to the literature are given in the notes and a solution to the exercise is available.
35.3 1-armed bandits
The 1-armed bandit problem is a special case where the Bayesian optimal policy
has a simple form that can often be computed efficiently. Before reading on,
you might like to refresh your memory by looking at Exercises 4.9 and 8.2. Let
(E,G,Q,P) be a 2-armed Bayesian bandit environment wherePν 2 =δμ 2 is a
Dirac at fixed constantμ 2 ∈Rfor allν∈ E. Because the mean of the second
arm is known in advance, we call this a Bayesian 1-armed bandit problem. In
Part (a) of Exercise 4.9 you showed that when the horizon is known it suffices
to consider only retirement policies that choose the first arm until some random
time before switching to the second arm until the end of the game. Since we care
about Bayesian optimal policies, the result of Exercise 34.8 allows us to restrict
our attention to deterministic retirement policies.