33.2 Best-arm identification with a fixed confidence 380
from the high probability Pinsker’s inequality (Theorem 14.2) and the stopping
time version of Lemma 15.1 (see Exercise 15.7). The second inequality holds
becausePνπ(τ=∞) = 0 andi∗(ν)∩i∗(ν′) =∅and
Ec={τ=∞}∪{τ <∞andAτ+1∈i∗(ν′)}
⊆{τ=∞}∪{τ <∞andAτ+1∈/i∗(ν)}.
Rearranging Eq. (33.4) shows that
∑k
i=1
Eνπ[Ti(τ)] D(νi,νi′)≥log
(
4
δ
)
, (33.5)
which implies thatEνπ[τ]>0. Using this, the definition ofc∗(ν) and Eq. (33.5),
Eνπ[τ]
c∗(ν) =Eνπ[τ] supα∈Pk− 1 ν′∈Einfalt(ν)
∑k
i=1
αiD(νi,νi′)
≥Eνπ[τ]ν′∈Einf
alt(ν)
∑k
i=1
Eνπ[Ti(τ)]
Eνπ[τ] D(νi,ν
′
i) (33.6)
=ν′∈Einf
alt(ν)
∑k
i=1
Eνπ[Ti(τ)] D(νi,ν′i)≥log
(
4
δ
)
,
where the last inequality follows from Eq. (33.5). Rearranging completes the proof.
Note that the special case thatc∗(ν)−^1 = 0, the assumption thatEνπ[τ]<∞
would lead to a contradiction.
Theorem 33.5 does not depend onEbeing unstructured. The assumption
that the bandits are finite-armed could also be relaxed with appropriate
measureability assumptions.
In a moment we will prove that the bound in Theorem 33.5 is asymptotically
optimal asδ →0 whenE=ENk(1), a result which holds more generally for
Bernoulli bandits or when the distributions come from an exponential family.
Before this we devote a little time to understanding the constantc∗(ν). Suppose
thatα∗(ν)∈Pk− 1 satisfies
c∗(ν)−^1 = inf
ν′∈Ealt(ν)
∑k
i=1
α∗i(ν) D(νi,ν′i).
A few observations about this optimization problem:
(a)The value ofα∗(ν) is unique whenE=ENk(1) andν∈Ehas a unique optimal
arm. Uniqueness continues to hold whenEis an unstructured bandit with
distributions from an exponential family.
(b)The inequality in Eq. (33.6) is tightest whenEνπ[Ti(τ)]/Eνπ[τ] =α∗i(ν),
which shows a policy can only match the lower bound by playing armi
exactly in proportion toα∗i(ν) in the limit asδtends to zero.