Bandit Algorithms

(Jeff_L) #1
This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]

9 The Upper Confidence Bound Algorithm: Minimax Optimality ( )


We proved that the variants of UCB analyzed in the last two chapters have a
worst-case regret ofRn=O(


knlog(n)). The factor of


log(n)can be removed
by modifying the confidence level of the algorithm. The directly named Minimax
Optimal Strategy in the Stochastic case algorithm (MOSS) was the first to make
this modification and is presented below. MOSS again depends on prior knowledge
of the horizon, a requirement that may be relaxed as we explain in the notes.

The termminimaxis used because, except for constant factors, the worst-
case bound proven in this chapter cannot be improved on by any algorithm.
The lower bounds are deferred to Part IV.

9.1 The MOSS algorithm


Algorithm 7 shows the pseudocode of MOSS, which is again an instance of the
UCB family. The main novelty is that the confidence level is chosen based on the
number of plays of the individual arms, as well asnandk.

1:Input nandk
2:Choose each arm once
3:Subsequently choose

At= argmaxiμˆi(t−1) +


4


Ti(t−1)

log+

(


n
kTi(t−1)

)


,


where log+(x) = log max{ 1 ,x}.

Algorithm 7:MOSS.

Theorem9.1.For any 1 -subgaussian bandit, the regret of Algorithm 7 satisfies

Rn≤ 38


kn+

∑k

i=1

∆i.
Free download pdf