Bandit Algorithms

11.7 Exercises 156

and define 2-armed adversarial bandit in terms of its losses by

yt 1 =

{

0 ift≤n/ 2 1 otherwise

and yt 2 =

{

α ift≤n/ 2 0 otherwise.

For simplicity you may assume thatnis even.

(a)Define sequence of real-valued functionsq 1 ,...,qnon domain [1/ 4 , 1 /2]
inductively byq 0 (α) = 1/2 and

qs+1(α) =

qs(α) exp(−ηα/qs(α)) 1 −qs(α) +qs(α) exp(−ηα/qs(α))

.

Show fort≤1 +n/2 thatPt 2 =qT 2 (t−1)(α), whereT 2 (t) =

∑t
s=1As^2.
(b)Show that for sufficiently largenthere exists anα∈[1/ 4 , 1 /2] ands∈N
such that

qs(α) =^1 8 n

and

∑s−^1

u=1

1

qu(α)

≤n 8

.

(c) Prove thatP(T 2 (n/2)≥s+ 1)≥ 1 /65.
(d) Prove thatP(Rˆn≥n/4)≥(1−nexp(−ηn)/2)/65.
(e)The previous part shows that the regret is linear with constant probability
for sufficiently largen. On the other hand, a dubious application of Markov’s
inequality and Theorem 11.1 shows that

P(Rˆn≥n/4)≤^4 E[ Rˆn] n

=O(n−^1 /^2 ).

Explain the apparent contradiction. (f)Find a choice ofα∈[1/ 4 , 1 /2] that satisfies the conditions of the theorem forn= 10^5 and some sensible learning rate. Then demonstrate empirically that Exp3 indeed suffers from linear regret with constant probability on this problem.

11.6(Gumbel trick) Leta 1 ,...,akbe positive real values andU 1 ,...,Ukbe
a sequence of independent and identically distributed uniform random variables
on [0,1]. Then letGi=−log(−log(Ui)), which follows astandard Gumbel
distribution. Prove that

P

(

log(ai) +Gi= max j∈[k]

(log(aj) +Gj)

)

=

ai ∑k j=1aj

.

11.7(Exp3 as follow the perturbed leader) Let (Zti)tibe a collection
of independent and identically distributed random variables. The follow the
perturbed leader algorithm chooses

At= argmaxi∈[k]

(

Zti−η

t∑− 1

s=1

Yˆsi

)

.

Bandit Algorithms

{

{

.

1

.

(

)

=

.

(

)

.

Get our desktop app

Company

Features

Documentation

Resources