Bandit Algorithms

35.3 1 -armed bandits 414

These facts allow us to frame the Bayesian 1-armed bandit problem in terms of optimal stopping. Define a probability space (Ω,F,P) carrying random elements ν∈EandZ= (Zt)nt=1where

(a)P(ν∈·) =Q(·).
(b)P(Z∈·|ν) =Pνn 1 (·), which means that after conditioning onν, the sequence
(Zt)nt=1is independent and identically distributed according toPν 2.

Given a deterministic retirement policyπ= (πt)nt=1define the random variable

τ= min{t:πt(2| 1 ,Z 1 ,..., 1 ,Zt− 1 ) = 1},

where the minimum of an empty set in this case isn+ 1. Clearlyτis aF-stopping
time, whereF= (Ft)nt=1+1withFt=σ(Z 1 ,...,Zt− 1 ). In fact, this correspondence
between deterministic retirement policies andF-stopping times is a bijection. The
Bayesian expected reward when following policyπassociated with stopping time
τis

E

[τ− 1 ∑

t=1

Zt+

∑n

t=τ

μ 2

]

=E[Uτ],

whereUt=

∑t− 1 s=1Zs+ (n−t+ 1)μ^2. Since minimizing the Bayesian regret is equivalent to maximizing the Bayesian expected cumulative reward, the problem of finding the Bayesian optimal policy has been reduced to an optimal stopping problem.

Proposition35.4.IfZ 1 is integrable, then the Bayesian regret is minimized by
the retirement policy associated with stopping timeτ=min{t≥1 :Ut=Et},
where

Et=

{

Ut, ift=n+ 1 ; max{Ut,E[Et+1|Ft]}, otherwise.

The interpretation ofEt is that it is the total expected optimal value conditioned on the information available at the start of roundt. The proposition is an immediate corollary of Theorem 35.2 and the fact that integrability of Z 1 is equivalent to integrability of (Ut)nt=1+1. The optimal stopping time in Proposition 35.4 can be rewritten in a more convenient form. For 1≤t≤n+ 1 defineWt=Et−

∑t− 1 s=1Zs, which can be seen as the optimal value to go for the lastn−t+ 1 rounds. The definition ofEtshows thatWn+1= 0 and fort≤n,

Wt= max

(

(n−t+ 1)μ 2 ,E[Et+1|Ft]−

t∑− 1

s=1

Zs

)

= max ((n−t+ 1)μ 2 ,E[Zt+Wt+1|Ft]). (35.3)

Hence the optimal stopping time can be rewritten as

τ∗= min{t:Ut=Et}= min{t:Wt= (n−t+ 1)μ 2 }.

Bandit Algorithms

]

{

(

)

Get our desktop app

Company

Features

Documentation

Resources