Bandit Algorithms

(Jeff_L) #1
8.4 Exercises 117

a policy by

At=




1 if ˆμ 1 (t−1) +


2 logf(t)
T 1 (t−1)≥^0
2 otherwise.

(8.5)


Suppose thatν= (P 1 ,P 2 ) whereP 1 =N(μ 1 ,1) andP 2 =N(0,1). Prove that
for the modified policy,

lim sup
n→∞

Rn(ν)
log(n)


{


0 ifμ 1 ≥ 0
2
μ^21 ifμ^1 <^0.
Hint Follow the analysis for UCB, but carefully adapt the proof by using the
fact that the index of the second arm is always 0.

The strategy proposed in the above exercise is based on the idea that
optimism is used to overcome uncertainty in the estimates of the quality of
an arm, but for 1-armed bandits the mean of the second arm is known in
advance.

8.4(One-armed bandits (iii)) The purpose of this question is to compare
UCB and the modified version in (8.5).

(a)Implement a simulator for the 1-armed bandit problem and two algorithms.
UCB and the modified version analysed in Exercise 8.3.
(b)Use your simulator to estimate the expected regret of each algorithm for a
horizon ofn= 1000 andμ 1 ∈[− 1 ,1].
(c)Plot your results withμ 1 on thex-axis and the estimated expected regret on
they-axis. Don’t forget to label the axis and include error bars and a legend.
(d) Explain the results. Why do the curves look the way they do?
(e)In your plot, for what values ofμ 1 does the worst-case expected regret for each
algorithm occur? What is the worst-case expected regret for each algorithm?


8.5(Different subgaussian constants) Letσ^2 ∈[0,∞)kbe known and
suppose that the reward isXt∼N(μAt,σA^2 t). Design an algorithm (that depends
onσ^2 ) for which the asymptotic regret is

lim sup
n→∞

Rn
log(n)

=



i:∆i> 0

2 σi^2
∆i

.

Free download pdf