33.2 Best-arm identification with a fixed confidence 382
If the inequality is tight, then we might guess that a reasonable stopping rule as
the first roundtwhen
inf
ν′∈Ealt(ν)
∑k
i=1
Ti(t) D(νi,νi′)&log
(
1
δ
)
.
There are two problems:(a)νis unknown, so the expression cannot be evaluated
and(b)we have replaced the expected number of pulls with the actual number of
pulls. Still, let us persevere. To deal with the first problem we can try replacing
νby the Gaussian bandit environment with mean vectorμˆ(t), which we denote
by ˆν(t). Then let
Zt= inf
ν′∈Ealt(ˆν(t))
∑k
i=1
Ti(t) D(ˆνi(t),νi′) =
1
2
inf
μ′∈Ealt(ˆμ(t))
∑k
i=1
Ti(t)(ˆμi(t)−μi(ν′))^2.
We will show that there exists a choice of βt(δ) ≈ log(t/δ) such that if
τ =min{t:Zt≥βt(δ)}, then the empirically optimal arm atτ is the best
arm with probability at least 1−δ. But will this policy stop early enough? As
we remarked earlier, if the policy is to match the lower bound it should play
armiapproximately in proportion toα∗i(ν). This suggests estimatingα∗(ν) by
αˆ(t) =α∗(νˆ(t)) and then playing the arm for whichtαˆi(t)−Ti(t) is maximized. If
αˆ(t) is inaccurate, then perhaps the samples collected will not allow the algorithm
to improve its estimates. To overcome this last challenge the policy includes
enough forced exploration to ensure that eventuallyαˆ(t) converges toα∗(ν) with
high probability. Combining all these ideas leads to the Track-and-Stop policy
(Algorithm 20).
1:Input δandβt(δ)
2:Choose each arm once
3:whileZt≤βt(δ)do
4: ifargmini∈[k]Ti(t−1)≤
√
tthen
5: ChooseAt= argmini∈[k]Ti(t−1)
6: else
7: ChooseAt= argmaxi∈[k](tαˆi∗(t−1)−Ti(t−1))
8: end if
9: ObserveXt, update statistics, incrementt
10:end while
11:returnAn+1=i∗(ˆν(t)),τ=t
Algorithm 20:Track-and-stop.
Theorem33.6.Let(π,τ)be the policy and stopping rule of Algorithm 20. There
exists a choice ofβt(δ)such that Algorithm 20 is sound and for allν∈Ewith