Bandit Algorithms

(Jeff_L) #1
16.1 Asymptotic bounds 197

In finite-time the situation is a little messy, but if one pushes these ideas to
the limit, then for many classes of bandits one can define a precise notion of
instance-dependent optimality.

16.1 Asymptotic bounds


We need to define exactly what is meant by a reasonable policy. If one is only
concerned with asymptotics, then a rather conservative definition suffices.

Definition16.1.A policyπis calledconsistentover a class of banditsEif for
allν∈Eandp >0 it holds that

nlim→∞

Rn(π,ν)
np

= 0. (16.1)


The class of consistent policies overEis denoted by Πcons(E).

Theorem 7.1 shows that UCB is consistent overEkSG(1). The strategy that
always chooses the first action is not consistent on any classEunless the first
arm is optimal for everyν∈E.

Consistency is an asymptotic notion. A policy could be consistent and yet
playAt= 1 for allt≤ 10100. For this reason an assumption of consistency
is insufficient to derive nonasymptotic lower bounds. In Section 16.2 we
introduce a finite-time version of consistency that allows us to prove finite-
time instance-dependent lower bounds.

Recall that a classEof stochastic bandits is unstructured ifE=M 1 ×···×Mk
withM 1 ,...,Mksets of distributions. The main theorem of this chapter is a
generic lower bound that applies to any unstructured class of stochastic bandits.
After the proof we will see some applications to specific classes. LetMbe a set
of distributions with finite means and letμ:M→Rbe the function that maps
P∈Mto its mean. Letμ∗∈RandP∈MhaveP(μ)< μ∗and define

dinf(P,μ∗,M) = inf
P′∈M

{D(P,P′) :μ(P′)> μ∗}.

Theorem16.2.LetE=M 1 ×···×Mkandπ∈Πcons(E)be a consistent policy
overE. Then for al lν= (Pi)ki=1∈Eit holds that

lim infn→∞ Rn
log(n)

≥c∗(ν,E) =


i:∆i> 0

∆i
dinf(Pi,μ∗,Mi)

, (16.2)


where∆iis the suboptimality gap of theith arm inνandμ∗is the mean of the
optimal arm.

Proof Letμibe the mean of theith arm inνanddi=dinf(Pi,μ∗,Mi). The
Free download pdf