Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

9 The Upper Confidence Bound Algorithm: Minimax Optimality ( )

We proved that the variants of UCB analyzed in the last two chapters have a worst-case regret ofRn=O(

√

knlog(n)). The factor of

√

log(n)can be removed by modifying the confidence level of the algorithm. The directly named Minimax Optimal Strategy in the Stochastic case algorithm (MOSS) was the first to make this modification and is presented below. MOSS again depends on prior knowledge of the horizon, a requirement that may be relaxed as we explain in the notes.

The termminimaxis used because, except for constant factors, the worst- case bound proven in this chapter cannot be improved on by any algorithm. The lower bounds are deferred to Part IV.

9.1 The MOSS algorithm

Algorithm 7 shows the pseudocode of MOSS, which is again an instance of the UCB family. The main novelty is that the confidence level is chosen based on the number of plays of the individual arms, as well asnandk.

1:Input nandk 2:Choose each arm once 3:Subsequently choose

At= argmaxiμˆi(t−1) +

√

4

Ti(t−1)

log+

(

n kTi(t−1)

)

,

where log+(x) = log max{ 1 ,x}.

Algorithm 7:MOSS.

Theorem9.1.For any 1 -subgaussian bandit, the regret of Algorithm 7 satisfies

Rn≤ 38

√

kn+

∑k

i=1

∆i.

Bandit Algorithms

9 The Upper Confidence Bound Algorithm: Minimax Optimality ( )

√

√

9.1 The MOSS algorithm

√

4

(

)

,

√

Get our desktop app

Company

Features

Documentation

Resources