Bandit Algorithms

11.7 Exercises 155

On most computersrand()will return a pseudo-random number and since there are only finitely many floating point numbers the resulting distribution will not really be uniform on [0,1]. Thinking about these issues is a worthy endeavour, and sometimes it really matters. For this exercise you may ignore these issues, however.

11.2 (Linear regret for deterministic policies) Show that for any
deterministic policyπthere exists an environment x∈[0,1]n×k such that
Rn(π,x)≥n(1− 1 /k). What does your result say about the policies designed in
Part II?

11.3(Alternative regret definition) Suppose we had defined the regret by

Rtrackn (π,x) =E

[n ∑

t=1

max i∈[k]

xti−

∑n

t=1

xtAt

]

.

At first sight this definition seems like the right thing because it measures what
you actually care about. Unfortunately, however, it gives the adversary too much
power. Show that for any policyπ(randomised or not) there exists ax∈[0,1]k×n
such that

Rntrack(π,x)≥n

(

1 −

1

k

)

.

11.4(Unbiased estimators are importance-weighted) LetP∈Pk− 1 be
a probability vector andA∼P. SupposeXˆ: [k]×R→Ris a function such that
for allx∈Rk,

E[Xˆ(A,xA)] =

∑k

i=1

PiXˆ(i,xi) =x 1.

Show there exists ana∈Rksuch that〈a,P〉= 0 andXˆ(i,x) =ai+I{i= 1}x^1 P 1

.

11.5(Variance of Exp3) In this exercise you will show that ifη∈[n−p,1] for
somep∈(0,1), then for sufficiently largenthere exists a bandit on which Exp3
has a constant probability of suffering linear regret. We work with losses so that
given a bandity∈[0,1]n×k, the learner samplesAtfromPtgiven by

Pti=

exp

(

−η

∑t− 1 s=1Yˆsi

)

∑k j=1exp

(

−η

∑t− 1 s=1Yˆsj

),

whereYˆti=Atiyti/Pti. Letα∈[1/ 4 , 1 /2] be a constant to be tuned subsequently

Bandit Algorithms

]

.

(

1 −

1

)

.

.

(

)

(

),

Get our desktop app

Company

Features

Documentation

Resources