Bandit Algorithms

(Jeff_L) #1
33.2 Best-arm identification with a fixed confidence 380

from the high probability Pinsker’s inequality (Theorem 14.2) and the stopping
time version of Lemma 15.1 (see Exercise 15.7). The second inequality holds
becausePνπ(τ=∞) = 0 andi∗(ν)∩i∗(ν′) =∅and
Ec={τ=∞}∪{τ <∞andAτ+1∈i∗(ν′)}
⊆{τ=∞}∪{τ <∞andAτ+1∈/i∗(ν)}.
Rearranging Eq. (33.4) shows that
∑k

i=1

Eνπ[Ti(τ)] D(νi,νi′)≥log

(


4


δ

)


, (33.5)


which implies thatEνπ[τ]>0. Using this, the definition ofc∗(ν) and Eq. (33.5),


Eνπ[τ]
c∗(ν) =Eνπ[τ] supα∈Pk− 1 ν′∈Einfalt(ν)

∑k

i=1

αiD(νi,νi′)

≥Eνπ[τ]ν′∈Einf
alt(ν)

∑k

i=1

Eνπ[Ti(τ)]
Eνπ[τ] D(νi,ν


i) (33.6)

=ν′∈Einf
alt(ν)

∑k

i=1

Eνπ[Ti(τ)] D(νi,ν′i)≥log

(


4


δ

)


,


where the last inequality follows from Eq. (33.5). Rearranging completes the proof.
Note that the special case thatc∗(ν)−^1 = 0, the assumption thatEνπ[τ]<∞
would lead to a contradiction.


Theorem 33.5 does not depend onEbeing unstructured. The assumption
that the bandits are finite-armed could also be relaxed with appropriate
measureability assumptions.

In a moment we will prove that the bound in Theorem 33.5 is asymptotically
optimal asδ →0 whenE=ENk(1), a result which holds more generally for
Bernoulli bandits or when the distributions come from an exponential family.
Before this we devote a little time to understanding the constantc∗(ν). Suppose
thatα∗(ν)∈Pk− 1 satisfies

c∗(ν)−^1 = inf
ν′∈Ealt(ν)

∑k

i=1

α∗i(ν) D(νi,ν′i).

A few observations about this optimization problem:


(a)The value ofα∗(ν) is unique whenE=ENk(1) andν∈Ehas a unique optimal
arm. Uniqueness continues to hold whenEis an unstructured bandit with
distributions from an exponential family.
(b)The inequality in Eq. (33.6) is tightest whenEνπ[Ti(τ)]/Eνπ[τ] =α∗i(ν),
which shows a policy can only match the lower bound by playing armi
exactly in proportion toα∗i(ν) in the limit asδtends to zero.

Free download pdf