Bandit Algorithms

33.3 Best-arm identification with a budget 385

33.3 Best-arm identification with a budget

In the fixed-budget variant of best-arm identification the learner is given the horizonnand should choose a policyπ= (πt)nt=1+1with the objective of minimizing the probability thatAn+1is suboptimal.

The constraint on the horizon rather than the confidence level makes this setting a bit more nuanced than the fixed confidence setting and the results are not as clean.

A naive option would be to use the uniform exploration policy, but as discussed in Section 33.1 this approach leads to poor results when the suboptimality gaps are not close. To overcome this problem the sequential halving algorithm divides the budget intoL=dlog 2 (k)ephases. In the first phase the algorithm chooses each arm equally often. The bottom half of the arms are then eliminated and the process is repeated.

1:Input nandk 2:SetL=dlog 2 (k)eandA 1 = [k]. 3:for`= 1,...,Ldo 4: LetT`=

⌊

n L|A`|

⌋

.

5: Choose each arm inA`exactlyT`times 6: For eachi∈A`computeμˆ`ias the empirical mean of armibased on the lastT`samples 7: LetA`+1contain the topd|A`|/ 2 earms inA` 8:end for 9:returnAn+1as the arm inAL+1 Algorithm 21:Sequential halving.

Theorem33.10.Ifν∈ESGk(1)has mean vectorμ=μ(ν)andμ 1 ≥···≥μkand πis sequential halving, then

Pνπ(∆An+1>0)≤3 log 2 (k) exp

(

−

n 16 H 2 (μ) log 2 (k)

)

,

whereH 2 (μ) = maxi∈[k]∆i (^2) i.
The assumption on the ordering of the means is only needed for the clean
definition ofH 2 , which would otherwise be defined by permuting the arms. The
algorithm is completely symmetric. In Exercise 33.9 we guide you through the
proof of Theorem 33.10.
The quantity H 2 (μ) looks a bit unusual, but arises naturally in the
analysis. It is related to a more familiar quantity as follows. DefineH 1 (μ) =

Bandit Algorithms

33.3 Best-arm identification with a budget

⌊

⌋

.

(

−

)

,

Get our desktop app

Company

Features

Documentation

Resources