Bandit Algorithms

(Jeff_L) #1
11.7 Exercises 155

On most computersrand()will return a pseudo-random number and since
there are only finitely many floating point numbers the resulting distribution
will not really be uniform on [0,1]. Thinking about these issues is a worthy
endeavour, and sometimes it really matters. For this exercise you may ignore
these issues, however.

11.2 (Linear regret for deterministic policies) Show that for any
deterministic policyπthere exists an environment x∈[0,1]n×k such that
Rn(π,x)≥n(1− 1 /k). What does your result say about the policies designed in
Part II?


11.3(Alternative regret definition) Suppose we had defined the regret by


Rtrackn (π,x) =E

[n

t=1

max
i∈[k]

xti−

∑n

t=1

xtAt

]


.


At first sight this definition seems like the right thing because it measures what
you actually care about. Unfortunately, however, it gives the adversary too much
power. Show that for any policyπ(randomised or not) there exists ax∈[0,1]k×n
such that


Rntrack(π,x)≥n

(


1 −


1


k

)


.


11.4(Unbiased estimators are importance-weighted) LetP∈Pk− 1 be
a probability vector andA∼P. SupposeXˆ: [k]×R→Ris a function such that
for allx∈Rk,


E[Xˆ(A,xA)] =

∑k

i=1

PiXˆ(i,xi) =x 1.

Show there exists ana∈Rksuch that〈a,P〉= 0 andXˆ(i,x) =ai+I{i= 1}x^1
P 1

.


11.5(Variance of Exp3) In this exercise you will show that ifη∈[n−p,1] for
somep∈(0,1), then for sufficiently largenthere exists a bandit on which Exp3
has a constant probability of suffering linear regret. We work with losses so that
given a bandity∈[0,1]n×k, the learner samplesAtfromPtgiven by


Pti=

exp

(


−η

∑t− 1
s=1Yˆsi

)


∑k
j=1exp

(


−η

∑t− 1
s=1Yˆsj

),


whereYˆti=Atiyti/Pti. Letα∈[1/ 4 , 1 /2] be a constant to be tuned subsequently

Free download pdf