Bandit Algorithms

(Jeff_L) #1
28.4 Linear bandits on the unit ball 318

Surprisingly, follow the regularized leader with a carefully chosen potential
improves on this bound by a factor of


d.
For the remainder of this section let‖·‖=‖·‖ 2 be the standard euclidean
norm andA=B 2 dbe the standard unit ball. In order to instantiate follow the
regularized leader we need a potential, a sampling rule, an unbiased estimator
and a learning rate. We start with the sampling rule and estimator. Recall that
in roundtwe need to choose a distribution onAwith meanA ̄tand sufficient
variability that the variance of the estimator is not too large. LetEt∈ { 0 , 1 }
satisfyE[Et|A 1 ,...,At− 1 ] = (1−‖A ̄t‖) andUtbe an independent and uniformly
distributed on{±e 1 ,...,±ed}. The algorithm chooses

At=EtUt+

(1−Et)A ̄t
‖A ̄t‖

.


Then the sampling distribution is just the law ofAt, which clearly satisfies
Et[At] =A ̄t. For the estimator we use a variant of the importance-weighted
estimator from the last chapter:


Yˆt=dEtAt〈At,yt〉
1 −‖A ̄t‖

. (28.12)


The reader can check for themselves that this estimator is unbiased. Next we
inspect the contents of our magicians hat and select the potential
F(a) =−log (1−‖a‖)−‖a‖.


There is one more modification. Rather than instantiating follow the regularized
leader with action setA, we useA ̃={x∈Rd:‖x‖ 2 ≤r}wherer <1 is a
radius to be tuned subsequently. The reason for this modification is to control the
variance of the estimator in Eq. (28.12), which blows up asA ̄tgets close to the
boundary. Note that the exploration means that the algorithm always playsAt
on the boundary ofA, but follow the regularized leader always choosesA ̄t∈A ̃.
You will show in Exercise 28.5 that


A ̄t= Π

(


ηLˆt− 1
1 +η‖Lˆt− 1 ‖

)


with Lˆt− 1 =

∑t−^1

s=1

Yˆs, (28.13)

where Π(x) is the projection operator ontoA ̄with respect to‖·‖ 2.


Theorem28.11.Assume that(yt)nt=1are a sequence of losses such that‖yt‖ 2 ≤ 1
for allt. Suppose that Algorithm 16 is run using the sampling rule, estimator
and potential as described above and the learning rate isη=



log(n)/(3dn)and
r= 1− 2 ηd. ThenRn≤ 2


3 ndlog(n).

You might notice that in some regimes this is smaller than the lower bound
for stochastic linear bandits (Theorem 24.2). There is no contradiction
because the adversarial and stochastic linear bandit models are actually
quite different. More details are in Chapter 29.
Free download pdf