Bandit Algorithms

33.2 Best-arm identification with a fixed confidence 380

from the high probability Pinsker’s inequality (Theorem 14.2) and the stopping time version of Lemma 15.1 (see Exercise 15.7). The second inequality holds becausePνπ(τ=∞) = 0 andi∗(ν)∩i∗(ν′) =∅and Ec={τ=∞}∪{τ <∞andAτ+1∈i∗(ν′)} ⊆{τ=∞}∪{τ <∞andAτ+1∈/i∗(ν)}. Rearranging Eq. (33.4) shows that ∑k

i=1

Eνπ[Ti(τ)] D(νi,νi′)≥log

(

4

δ

)

, (33.5)

which implies thatEνπ[τ]>0. Using this, the definition ofc∗(ν) and Eq. (33.5),

Eνπ[τ] c∗(ν) =Eνπ[τ] supα∈Pk− 1 ν′∈Einfalt(ν)

∑k

i=1

αiD(νi,νi′)

≥Eνπ[τ]ν′∈Einf alt(ν)

∑k

i=1

Eνπ[Ti(τ)] Eνπ[τ] D(νi,ν

′ i) (33.6)

=ν′∈Einf alt(ν)

∑k

i=1

Eνπ[Ti(τ)] D(νi,ν′i)≥log

(

4

δ

)

,

where the last inequality follows from Eq. (33.5). Rearranging completes the proof.
Note that the special case thatc∗(ν)−^1 = 0, the assumption thatEνπ[τ]<∞
would lead to a contradiction.

Theorem 33.5 does not depend onEbeing unstructured. The assumption that the bandits are finite-armed could also be relaxed with appropriate measureability assumptions.

In a moment we will prove that the bound in Theorem 33.5 is asymptotically optimal asδ →0 whenE=ENk(1), a result which holds more generally for Bernoulli bandits or when the distributions come from an exponential family. Before this we devote a little time to understanding the constantc∗(ν). Suppose thatα∗(ν)∈Pk− 1 satisfies

c∗(ν)−^1 = inf ν′∈Ealt(ν)

∑k

i=1

α∗i(ν) D(νi,ν′i).

A few observations about this optimization problem:

(a)The value ofα∗(ν) is unique whenE=ENk(1) andν∈Ehas a unique optimal
arm. Uniqueness continues to hold whenEis an unstructured bandit with
distributions from an exponential family.
(b)The inequality in Eq. (33.6) is tightest whenEνπ[Ti(τ)]/Eνπ[τ] =α∗i(ν),
which shows a policy can only match the lower bound by playing armi
exactly in proportion toα∗i(ν) in the limit asδtends to zero.

Bandit Algorithms

(

4

)

, (33.5)

(

4

)

,

Get our desktop app

Company

Features

Documentation

Resources