Bandit Algorithms

(Jeff_L) #1
11.7 Exercises 156

and define 2-armed adversarial bandit in terms of its losses by

yt 1 =

{


0 ift≤n/ 2
1 otherwise

and yt 2 =

{


α ift≤n/ 2
0 otherwise.

For simplicity you may assume thatnis even.

(a)Define sequence of real-valued functionsq 1 ,...,qnon domain [1/ 4 , 1 /2]
inductively byq 0 (α) = 1/2 and


qs+1(α) =

qs(α) exp(−ηα/qs(α))
1 −qs(α) +qs(α) exp(−ηα/qs(α))

.


Show fort≤1 +n/2 thatPt 2 =qT 2 (t−1)(α), whereT 2 (t) =

∑t
s=1As^2.
(b)Show that for sufficiently largenthere exists anα∈[1/ 4 , 1 /2] ands∈N
such that


qs(α) =^1
8 n

and

∑s−^1

u=1

1


qu(α)

≤n
8

.


(c) Prove thatP(T 2 (n/2)≥s+ 1)≥ 1 /65.
(d) Prove thatP(Rˆn≥n/4)≥(1−nexp(−ηn)/2)/65.
(e)The previous part shows that the regret is linear with constant probability
for sufficiently largen. On the other hand, a dubious application of Markov’s
inequality and Theorem 11.1 shows that


P(Rˆn≥n/4)≤^4 E[
Rˆn]
n

=O(n−^1 /^2 ).

Explain the apparent contradiction.
(f)Find a choice ofα∈[1/ 4 , 1 /2] that satisfies the conditions of the theorem
forn= 10^5 and some sensible learning rate. Then demonstrate empirically
that Exp3 indeed suffers from linear regret with constant probability on this
problem.

11.6(Gumbel trick) Leta 1 ,...,akbe positive real values andU 1 ,...,Ukbe
a sequence of independent and identically distributed uniform random variables
on [0,1]. Then letGi=−log(−log(Ui)), which follows astandard Gumbel
distribution. Prove that


P

(


log(ai) +Gi= max
j∈[k]

(log(aj) +Gj)

)


=


ai
∑k
j=1aj

.


11.7(Exp3 as follow the perturbed leader) Let (Zti)tibe a collection
of independent and identically distributed random variables. The follow the
perturbed leader algorithm chooses


At= argmaxi∈[k]

(


Zti−η

t∑− 1

s=1

Yˆsi

)


.

Free download pdf