Bandit Algorithms

33.2 Best-arm identification with a fixed confidence 383

|i∗(ν)|= 1it holds that

lim δ→ 0

Eνπ[τ] log(1/δ)

=c∗(ν).

Note thatπandτboth depend onδinside the limit in the statement of the theorem. The following lemma guarantees the soundness of (π,τ).

Lemma33.7.Letf: [k,∞)→Rbe given byf(x) =exp(k−x)(x/k)kand βt(δ) =klog(t^2 +t) +f−^1 (δ). Then forτ=min{t:Zt≥βt(δ)}it holds that P(i∗(ˆν(τ)) 6 =i∗(ν))≤δ.

The inversef−^1 (δ) is well defined becausefis strictly decreasing on [k,∞) withf(k) = 1 andlimx→∞f(x) = 0. In fact, the inverse has a closed form solution in terms of the Lambert W function. By staring at the form offone can check thatlimδ→ 0 f−^1 (δ)/log(1/δ) = 1 or equivalently that f−^1 (δ) = (1 +o(1)) log(1/δ).

Proof of Lemma 33.7 Observe thati∗(ˆν(τ)) is a singleton because fortwith |i∗(νˆ(t))|>1 it holds thatZt= 0. This means we need not worry about how ties are broken in the argmax of the algorithm. Abbreviateμ=μ(ν) and ∆ = ∆(ν) and assume without loss of generality that ∆ 1 = 0. By the definition ofτ, if ν∈Ealt(ˆν(τ)), then

1

2

∑k

i=1

Ti(τ)(ˆμi(τ)−μi)^2 > βτ(δ).

Using the definition ofEalt(ˆν(τ)) yields

P(16∈i∗(ˆν(τ))) =P(ν∈Ealt(ˆν(τ)))≤P

(

1

2

∑k

i=1

Ti(τ)(ˆμi(τ)−μi)^2 > βτ(δ)

)

.

Then apply Lemma 33.8 and Proposition 33.9 from Section 33.2.3.

Below we sketch the proof of Theorem 33.6. A more complete outline is given in Exercise 33.7.

Proof sketch of Theorem 33.6 Lemma 33.7 shows that (π,τ) are sound. It remains to control the expectation of the stopping time. The intuition is straightforward. As more samples are collected we expect thatαˆ(t)≈α∗(ν)

Bandit Algorithms

1

2

(

1

2

)

.

Get our desktop app

Company

Features

Documentation

Resources