Bandit Algorithms

33.2 Best-arm identification with a fixed confidence 382

If the inequality is tight, then we might guess that a reasonable stopping rule as the first roundtwhen

inf ν′∈Ealt(ν)

∑k

i=1

Ti(t) D(νi,νi′)&log

(

1

δ

)

.

There are two problems:(a)νis unknown, so the expression cannot be evaluated
and(b)we have replaced the expected number of pulls with the actual number of
pulls. Still, let us persevere. To deal with the first problem we can try replacing
νby the Gaussian bandit environment with mean vectorμˆ(t), which we denote
by ˆν(t). Then let

Zt= inf ν′∈Ealt(ˆν(t))

∑k

i=1

Ti(t) D(ˆνi(t),νi′) =

1

2

inf μ′∈Ealt(ˆμ(t))

∑k

i=1

Ti(t)(ˆμi(t)−μi(ν′))^2.

We will show that there exists a choice of βt(δ) ≈ log(t/δ) such that if
τ =min{t:Zt≥βt(δ)}, then the empirically optimal arm atτ is the best
arm with probability at least 1−δ. But will this policy stop early enough? As
we remarked earlier, if the policy is to match the lower bound it should play
armiapproximately in proportion toα∗i(ν). This suggests estimatingα∗(ν) by
αˆ(t) =α∗(νˆ(t)) and then playing the arm for whichtαˆi(t)−Ti(t) is maximized. If
αˆ(t) is inaccurate, then perhaps the samples collected will not allow the algorithm
to improve its estimates. To overcome this last challenge the policy includes
enough forced exploration to ensure that eventuallyαˆ(t) converges toα∗(ν) with
high probability. Combining all these ideas leads to the Track-and-Stop policy
(Algorithm 20).

1:Input δandβt(δ) 2:Choose each arm once 3:whileZt≤βt(δ)do 4: ifargmini∈[k]Ti(t−1)≤

√

tthen 5: ChooseAt= argmini∈[k]Ti(t−1) 6: else 7: ChooseAt= argmaxi∈[k](tαˆi∗(t−1)−Ti(t−1)) 8: end if 9: ObserveXt, update statistics, incrementt 10:end while 11:returnAn+1=i∗(ˆν(t)),τ=t Algorithm 20:Track-and-stop.

Theorem33.6.Let(π,τ)be the policy and stopping rule of Algorithm 20. There
exists a choice ofβt(δ)such that Algorithm 20 is sound and for allν∈Ewith

Bandit Algorithms

(

1

)

.

1

2

√

Get our desktop app

Company

Features

Documentation

Resources