Bandit Algorithms

(Jeff_L) #1
25.1 An asymptotic lower bound for fixed action sets 281

lim infn→∞λmin(G ̄n)/log(n)> 0. Furthermore, for anya∈Ait holds that:

lim sup
n→∞

log(n)‖a‖^2 G ̄−n 1 ≤

∆^2 a
2.

The reader should recognize‖a‖^2 G ̄−n 1 as the key term in the width of the
confidence interval for the least squares estimator (Chapter 20). This is quite
intuitive. The theorem is saying that any consistent algorithm must prove
statistically that all suboptimal arms are indeed suboptimal by making the
size of the confidence interval smaller than the suboptimality gap. Before the
proof of this result we give a corollary that characterizes the asymptotic regret
that must be endured by any consistent policy.

Corollary25.2.LetA⊂Rdbe a finite set that spansRdandθ∈Rdbe such
that there is a unique optimal action. Then for any consistent policy


lim infn→∞

Rn(A,θ)
log(n) ≥c(A,θ),

wherec(A,θ)is defined as


c(A,θ) = inf
α∈[0,∞)A


a∈A

α(a)∆a

subject to‖a‖^2 Hα− 1 ≤

∆^2 a
2 for alla∈Awith∆a>^0 ,

withHα=



a∈Aα(a)aa>.

The lower bound is complemented by a matching upper bound that we will
not prove.

Theorem25.3.LetA ⊂Rdbe a finite set that spansRd. Then there exists a
policy such that


lim sup
n→∞

Rn(A,θ)
log(n)

≤c(A,θ),

wherec(A,θ)is defined as in Corollary 25.2.


Proof of Theorem 25.1 The proof of the first part is simply omitted (see the
reference below for details). It follows along similar lines to what follows, essentially
that ifGnis not sufficiently large in every direction, then some alternative
parameter is not sufficiently identifiable. Leta∗=argmaxa∈A〈a,θ〉be the optimal
action, which we assumed to be unique. Letθ′∈Rdbe an alternative parameter
to be chosen subsequently and letPandP′be the measures on the sequence
of outcomesA 1 ,X 1 ,...,An,Xninduced by the interaction between the policy
and the bandit determined byθandθ′respectively. LetE[·] andE′[·] be the
expectation operators ofPandP′respectively. By Theorem 14.2 and Lemma 15.1
Free download pdf