Bandit Algorithms

(Jeff_L) #1
19.6 Exercises 236

19.2(Elliptical potentials) LetV 0 =λIandx 1 ,...,xn∈Rdbe a sequence
of vectors with‖xt‖ 2 ≤Lfor allt∈[n]. Then letVt=V 0 +


∑t
s=1xsx

>sand show
that the number of times‖xt‖Vt−− 11 ≥1 is at most

3 d
log(2)

log

(


1 +


L^2


λlog(2)

)


.


The proof of Theorem 19.2 depended on part (a) of Assumption 19.1, which
asserts that the mean rewards are bounded by 1. Suppose we replace this
assumption with the relaxation that there exists aB >0 such that

max
t∈[n]

sup
a,b∈At

〈a−b,θ∗〉≤B.

Then the previous exercise allows you to bound the number of rounds when
‖xt‖Vt−− 11 ≥1 and in these rounds the naive bound ofrt≤Bis used. For the
remaining rounds the analysis of Theorem 19.2 goes through unaltered. As a
consequence we see that the dependence onBis an additive constant term
that does not grow with the horizon.

19.3(Lipschitz reward functions) Consider the finite-armed stochastic
contextual setting of Section 19.1 and assume thatC= [0,1] and that the reward
functionsr(·,i) :C →[0,1] areL-Lipschitz:


|r(x,i)−r(y,i)|≤L|x−y| for allx,y∈[0,1],i∈[k].

(a)Construct an algorithm whose regretRnafternrounds isO((Lklogk)^1 /^3 n^2 /^3 ).


(b) Show that the minimax optimal regret is of the order Ω((Lk)^1 /^3 n^2 /^3 ).


(c)Generalize the result to the case whenC = [0,1]dand in the definition
of Lipschitzness we use the Euclidean norm. Show the dependence on the
dimension in the lower and upper bounds. Discuss the influence of the choice
of the norm.

Hint Consider discretizingC. Alternatively, use Exp4.
Free download pdf