35.3 1 -armed bandits 414
These facts allow us to frame the Bayesian 1-armed bandit problem in terms of
optimal stopping. Define a probability space (Ω,F,P) carrying random elements
ν∈EandZ= (Zt)nt=1where
(a)P(ν∈·) =Q(·).
(b)P(Z∈·|ν) =Pνn 1 (·), which means that after conditioning onν, the sequence
(Zt)nt=1is independent and identically distributed according toPν 2.
Given a deterministic retirement policyπ= (πt)nt=1define the random variable
τ= min{t:πt(2| 1 ,Z 1 ,..., 1 ,Zt− 1 ) = 1},
where the minimum of an empty set in this case isn+ 1. Clearlyτis aF-stopping
time, whereF= (Ft)nt=1+1withFt=σ(Z 1 ,...,Zt− 1 ). In fact, this correspondence
between deterministic retirement policies andF-stopping times is a bijection. The
Bayesian expected reward when following policyπassociated with stopping time
τis
E
[τ− 1
∑
t=1
Zt+
∑n
t=τ
μ 2
]
=E[Uτ],
whereUt=
∑t− 1
s=1Zs+ (n−t+ 1)μ^2. Since minimizing the Bayesian regret is
equivalent to maximizing the Bayesian expected cumulative reward, the problem
of finding the Bayesian optimal policy has been reduced to an optimal stopping
problem.
Proposition35.4.IfZ 1 is integrable, then the Bayesian regret is minimized by
the retirement policy associated with stopping timeτ=min{t≥1 :Ut=Et},
where
Et=
{
Ut, ift=n+ 1 ;
max{Ut,E[Et+1|Ft]}, otherwise.
The interpretation ofEt is that it is the total expected optimal value
conditioned on the information available at the start of roundt. The proposition
is an immediate corollary of Theorem 35.2 and the fact that integrability of
Z 1 is equivalent to integrability of (Ut)nt=1+1. The optimal stopping time in
Proposition 35.4 can be rewritten in a more convenient form. For 1≤t≤n+ 1
defineWt=Et−
∑t− 1
s=1Zs, which can be seen as the optimal value to go for the
lastn−t+ 1 rounds. The definition ofEtshows thatWn+1= 0 and fort≤n,
Wt= max
(
(n−t+ 1)μ 2 ,E[Et+1|Ft]−
t∑− 1
s=1
Zs
)
= max ((n−t+ 1)μ 2 ,E[Zt+Wt+1|Ft]). (35.3)
Hence the optimal stopping time can be rewritten as
τ∗= min{t:Ut=Et}= min{t:Wt= (n−t+ 1)μ 2 }.