Bandit Algorithms

(Jeff_L) #1
27.3 Continuous exponential weights 304

which implies that

|ηYˆt(a)|≤

η
γmaxν∈Aν


Q− (^1) (π)ν=η
γmaxν∈A‖ν‖
(^2) Q− (^1) (π). (27.3)
From Theorem 21.1 (Kiefer–Wolfowitz) we know that there exists a sampling
distributionπsuch thatmaxν∈A‖ν‖^2 Q− (^1) (π)≤d(and in fact the relation holds
with equality). By choosingγ=ηdand plugging into (27.2) we get
Rn≤
logk
η



  • 3ηdn= 2




3 dnlog(k),

where the last equality is derived by choosingη=


log(k)
3 dn.

27.3 Continuous exponential weights


The dependence onlog(k) in the regret guarantee provided by Theorem 27.1
is objectionable when the number of arms is extremely large or infinite. One
approach is find a finite subsetC ⊆Afor which

sup
a∈A

minb∈Csup
y∈L

|〈a−b,y〉|≤ 1 /n.

A standard calculation shows (Exercise 27.6) shows thatCcan always be chosen
so thatlog|C|≤dlog(6dn). Then it is easy to check that Exp3 onCsuffers regret
relative to the best action inAof at mostRn=O(d


nlog(nd)). The problem
with this approach is thatCis exponentially large ind, which makes this algorithm
intractable in most situations. WhenAis convex a more computationally tractable
approach is to use thecontinuous exponential weightsalgorithm.

For this section we assume thatAis convex and has positive Lebesgue
measure. The latter condition can be relaxed with some care (Exercise 27.10).

Letπbe a probability measure supported onA. The continuous exponential
weights policy samplesAtfromPt= (1−γ)P ̃t+γπwhereP ̃tis a measure
supported onAdefined by

P ̃t(B) =


Bexp

(


−η

∑t− 1
s=1Yˆs(a)

)


da

Aexp

(


−η

∑t− 1
s=1Yˆs(a)

)


da

. (27.4)


We will shortly see that the analysis in the previous section can be copied almost
verbatim to prove a regret bound for this strategy. But what has been bought here?
Rather than sampling from a discrete distribution on a large number of arms, we
now have to sample from a probability measure on a convex set. Sampling from
arbitrary probability measures is itself a challenging problem, but under certain
conditions there are polynomial time algorithms for this problem. The factors that
Free download pdf