Bandit Algorithms

(Jeff_L) #1
33.3 Best-arm identification with a budget 385

33.3 Best-arm identification with a budget


In the fixed-budget variant of best-arm identification the learner is given the
horizonnand should choose a policyπ= (πt)nt=1+1with the objective of minimizing
the probability thatAn+1is suboptimal.

The constraint on the horizon rather than the confidence level makes this
setting a bit more nuanced than the fixed confidence setting and the results
are not as clean.

A naive option would be to use the uniform exploration policy, but as discussed
in Section 33.1 this approach leads to poor results when the suboptimality gaps
are not close. To overcome this problem the sequential halving algorithm divides
the budget intoL=dlog 2 (k)ephases. In the first phase the algorithm chooses
each arm equally often. The bottom half of the arms are then eliminated and the
process is repeated.

1:Input nandk
2:SetL=dlog 2 (k)eandA 1 = [k].
3:for`= 1,...,Ldo
4: LetT`=


n
L|A`|


.


5: Choose each arm inA`exactlyT`times
6: For eachi∈A`computeμˆ`ias the empirical mean of armibased on the
lastT`samples
7: LetA`+1contain the topd|A`|/ 2 earms inA`
8:end for
9:returnAn+1as the arm inAL+1
Algorithm 21:Sequential halving.

Theorem33.10.Ifν∈ESGk(1)has mean vectorμ=μ(ν)andμ 1 ≥···≥μkand
πis sequential halving, then

Pνπ(∆An+1>0)≤3 log 2 (k) exp

(



n
16 H 2 (μ) log 2 (k)

)


,


whereH 2 (μ) = maxi∈[k]∆i (^2) i.
The assumption on the ordering of the means is only needed for the clean
definition ofH 2 , which would otherwise be defined by permuting the arms. The
algorithm is completely symmetric. In Exercise 33.9 we guide you through the
proof of Theorem 33.10.
The quantity H 2 (μ) looks a bit unusual, but arises naturally in the
analysis. It is related to a more familiar quantity as follows. DefineH 1 (μ) =

Free download pdf