Bandit Algorithms

33.2 Best-arm identification with a fixed confidence 384

and ˆμ≈μand

Zt= inf ̃ν∈Ealt(ˆν(t))

∑k

i=1

Ti(t)(ˆμi(t)−μi( ̃ν))^2 2

≈t inf ̃ν∈Ealt(ν)

∑k

i=1

α∗i(ν)(μi(ν)−μi( ̃ν))^2 2

=

t c∗(ν)

.

Provided the approximation is reasonably accurate, the algorithm should halt once t c∗(ν)≥βt(δ) = (1 +o(1)) log(1/δ),

which occurs oncet≥(1 +o(1))c∗(ν) log(1/δ).

33.2.3 Concentration

The first concentration theorem follows from Corollary 5.5 and a union bound.

Lemma33.8.Let(Xt)∞t=1be a sequence of independent Gaussian random variables with meanμand unit variance. Letμˆn=n^1

∑n t=1Xt. Then

P

(

existsn∈N+:

n 2 (ˆμn−μ)

(^2) ≥log(1/δ) + log(n(n+ 1))

)

≤δ.

Proposition33.9. Letg :N →Rbe increasing and for each i∈ [k]let Si 1 ,Si 2 ,...be an infinite sequence of random variables such that for al lδ∈(0,1),

P(existss∈N:Sis≥g(s) + log(1/δ))≤δ.

Then provided that(Si)ki=1are independent andx≥ 0 ,

P

(

existss∈Nk:

∑k

i=1

Sisi≥kg

(k ∑

i=1

si

)

+x

)

≤

(x k

)k exp(k−x).

Proof Fori∈[k] letWi=max{w∈[0,1] :Sis< g(s) +log(1/w)for alls∈N}, where we definelog(1/0) =∞. Note thatWiare well-defined. Then, for any s∈Nk,

∑k

i=1

Sisi≤

∑k

i=1

g(si) +

∑k

i=1

log(1/Wi)≤kg

(k ∑

i=1

si

)

+

∑k

i=1

log(1/Wi).

By assumption (Wi)ki=1are independent and satisfyP(Wi≤x)≤xfor all x∈[0,1]. The proof is completed by using Exercise 5.16.

Bandit Algorithms

.

(

)

P

(

)

)

≤

)

+

Get our desktop app

Company

Features

Documentation

Resources