Bandit Algorithms

(Jeff_L) #1
35.8 Exercises 428

35.2(Finite horizon optimal stopping) Prove Theorem 35.2.

Hint Prove that (Et)nt=1is aF-adapted supermartingale and that for stopping
timeτsatisfying the conditions of the theorem that (Mt)nt=1defined byMt=Et∧τ
is a martingale. Then apply the optional stopping theorem (Theorem 3.8).
35.3(Infinite horizon optimal stopping) Prove Theorem 35.3.

Hint This is a technical exercise. Modify the proof of [Peskir and Shiryaev,
2006, Theorem 1.11] to the situation where stopping times are allowed to be
infinite and using the almost-sure convergence of (Ut)tast→∞.
35.4(Equivalence of Gittins definitions) Prove that the definitions of
the Gittins index given in Eq. (35.7) and Eq. (35.8) are equivalent.

35.5 Prove Lemma 35.7.

Hint Find a way to apply Theorem 35.3.
35.6 Consider that the discounted bandit with Markov payoffs described in
Example 35.8. Show that there is a one-to-one correspondenceφbetween the
policies for this problem and the discounted Bayesian bandit withBeta(1,1) on
the mean reward of each arm such that the total expected discounted reward
(value) is invariant underφ.


35.7 Prove Lemma 35.10.

Hint Use the Hardy–Littlewood inequality, which for infinite sequences states
that for any real, increasing sequences (xn)∞n=1, (yn)∞n=1 and any bijection
σ:N+→N+it holds that

∑∞


n=1xnyn≥

∑∞


n=1xnyσ(n).
35.8 Prove Lemma 35.5.

35.9(Correctness of Varaiya’s algorithm) Prove the correctness of
Varaiya’s algorithm as explained in Section 35.5.


35.10 In this exericse you will implement some Bayesian (near-)optimal 1-armed
bandit algorithms.

(a) Reproduce the experimental results in Section 35.3.
(b)Implement an approximation of the optimal policy for 1-armed Gaussian
bandits and compare its performance to the stopping ruleταdefined below
for a variety of different choices ofα >0.


τα= min

{


t≥2 : ˆμt− 1 +


2 max{ 0 ,log(αn/t)}
t− 1

≤μ 2

}


.

Free download pdf