Bandit Algorithms

(Jeff_L) #1
29.6 Bibliographic remarks 334

29.6 Bibliographic remarks


Linear bandits on the sphere with parameter noise have been studied by Carpentier
and Munos [2012]. However they consider the case where the action set is the
sphere and the components of the noise are independent so that the reward is
Xt=〈At,θ+ηt〉where the coordinates ofηt∈Rdare independent with unit
variance. In this case the predictable variation isV[Xt|At] =

∑d
i=1A

(^2) ti= 1 for
all actionsAtand the parameter noise is equivalent to the standard model. We
are not aware of any systematic studies of parameter noise in the stochastic
setting. With only a few exceptions, the impact on the regret of the action set
and adversary’s choices is not well understood beyond the case whereAis an
`p-ball. A variety of lower bounds illustrating the complications are given by
Shamir [2015]. Perhaps the most informative is the observation that obtaining
O(



dn) regret is not possible whenA={a+x:‖x‖ 2 ≤ 1 }is a shifted unit ball
witha= (2, 0 ,...,0), which also follows from our reduction in Section 29.2.

29.7 Exercises


29.1(Reductions) LetA ⊂ Rdbe an action set and L ={y ∈Rd :
supa∈A|〈a,y〉| ≤ 1 }. Take an adversarial linear bandit algorithm that enjoys
a worst-case guaranteeBnon itsn-round expected regretRnwhen the adversary
is restricted to playingθt∈L. Show that if this algorithm is used in a stochastic
linear bandit problem with parameter noise whereθt∼νandSupp(ν)⊆L, then
the expected regretR′nis still bounded byBn.

29.2(Follow the regularized leader for stochastic bandits) Consider
a stochastic linear bandit withA=Bd 2 and lossYt=〈At,θ〉+ηtwhere (ηt)nt=1
are independent with zero mean andYt∈[− 1 ,1] almost surely. Adapt the proof of
Theorem 28.11 to show that with appropriate tuning the algorithm in Section 28.4
satisfiesRn≤Cd


nlog(n) for universal constantC >0.

Hint Repeat the analysis in the proof of Theorem 28.11, update the learning
rate and check the bounds on the norm of the estimators.
29.3(Follow the regularized leader for stochastic bandits (ii))
Repeat the previous exercise using exponential weights or continuous exponential
weights with Kiefer–Wolfowitz exploration where:

(a)Ais finite.
(b)Ais convex.

29.4(Unrealizable linear bandits) LetA⊂Rdand (ηt)nt=1be a sequence
of independent zero mean random variables and assume the loss is

Yt=`(At) +ηt,
Free download pdf