Bandit Algorithms

29.6 Bibliographic remarks 334

29.6 Bibliographic remarks

Linear bandits on the sphere with parameter noise have been studied by Carpentier and Munos [2012]. However they consider the case where the action set is the sphere and the components of the noise are independent so that the reward is Xt=〈At,θ+ηt〉where the coordinates ofηt∈Rdare independent with unit variance. In this case the predictable variation isV[Xt|At] =

∑d i=1A

(^2) ti= 1 for
all actionsAtand the parameter noise is equivalent to the standard model. We
are not aware of any systematic studies of parameter noise in the stochastic
setting. With only a few exceptions, the impact on the regret of the action set
and adversary’s choices is not well understood beyond the case whereAis an
`p-ball. A variety of lower bounds illustrating the complications are given by
Shamir [2015]. Perhaps the most informative is the observation that obtaining
O(

√

dn) regret is not possible whenA={a+x:‖x‖ 2 ≤ 1 }is a shifted unit ball witha= (2, 0 ,...,0), which also follows from our reduction in Section 29.2.

29.7 Exercises

29.1(Reductions) LetA ⊂ Rdbe an action set and L ={y ∈Rd : supa∈A|〈a,y〉| ≤ 1 }. Take an adversarial linear bandit algorithm that enjoys a worst-case guaranteeBnon itsn-round expected regretRnwhen the adversary is restricted to playingθt∈L. Show that if this algorithm is used in a stochastic linear bandit problem with parameter noise whereθt∼νandSupp(ν)⊆L, then the expected regretR′nis still bounded byBn.

29.2(Follow the regularized leader for stochastic bandits) Consider a stochastic linear bandit withA=Bd 2 and lossYt=〈At,θ〉+ηtwhere (ηt)nt=1 are independent with zero mean andYt∈[− 1 ,1] almost surely. Adapt the proof of Theorem 28.11 to show that with appropriate tuning the algorithm in Section 28.4 satisfiesRn≤Cd

√

nlog(n) for universal constantC >0.

Hint Repeat the analysis in the proof of Theorem 28.11, update the learning rate and check the bounds on the norm of the estimators. 29.3(Follow the regularized leader for stochastic bandits (ii)) Repeat the previous exercise using exponential weights or continuous exponential weights with Kiefer–Wolfowitz exploration where:

(a)Ais finite. (b)Ais convex.

29.4(Unrealizable linear bandits) LetA⊂Rdand (ηt)nt=1be a sequence of independent zero mean random variables and assume the loss is

Yt=`(At) +ηt,

Bandit Algorithms

29.6 Bibliographic remarks

√

29.7 Exercises

√

Get our desktop app

Company

Features

Documentation

Resources