Bandit Algorithms

27.3 Continuous exponential weights 304

which implies that

|ηYˆt(a)|≤

η γmaxν∈Aν

Q− (^1) (π)ν=η
γmaxν∈A‖ν‖
(^2) Q− (^1) (π). (27.3)
From Theorem 21.1 (Kiefer–Wolfowitz) we know that there exists a sampling
distributionπsuch thatmaxν∈A‖ν‖^2 Q− (^1) (π)≤d(and in fact the relation holds
with equality). By choosingγ=ηdand plugging into (27.2) we get
Rn≤
logk
η

3ηdn= 2

√

3 dnlog(k),

where the last equality is derived by choosingη=

√

log(k) 3 dn.

27.3 Continuous exponential weights

The dependence onlog(k) in the regret guarantee provided by Theorem 27.1 is objectionable when the number of arms is extremely large or infinite. One approach is find a finite subsetC ⊆Afor which

sup a∈A

minb∈Csup y∈L

|〈a−b,y〉|≤ 1 /n.

A standard calculation shows (Exercise 27.6) shows thatCcan always be chosen so thatlog|C|≤dlog(6dn). Then it is easy to check that Exp3 onCsuffers regret relative to the best action inAof at mostRn=O(d

√

nlog(nd)). The problem with this approach is thatCis exponentially large ind, which makes this algorithm intractable in most situations. WhenAis convex a more computationally tractable approach is to use thecontinuous exponential weightsalgorithm.

For this section we assume thatAis convex and has positive Lebesgue measure. The latter condition can be relaxed with some care (Exercise 27.10).

Letπbe a probability measure supported onA. The continuous exponential weights policy samplesAtfromPt= (1−γ)P ̃t+γπwhereP ̃tis a measure supported onAdefined by

P ̃t(B) =

∫

Bexp

(

−η

∑t− 1 s=1Yˆs(a)

)

da ∫ Aexp

(

−η

∑t− 1 s=1Yˆs(a)

)

da

. (27.4)

We will shortly see that the analysis in the previous section can be copied almost verbatim to prove a regret bound for this strategy. But what has been bought here? Rather than sampling from a discrete distribution on a large number of arms, we now have to sample from a probability measure on a convex set. Sampling from arbitrary probability measures is itself a challenging problem, but under certain conditions there are polynomial time algorithms for this problem. The factors that

Bandit Algorithms

√

√

27.3 Continuous exponential weights

√

∫

(

)

(

)

. (27.4)

Get our desktop app

Company

Features

Documentation

Resources