Bandit Algorithms

This material will be published by Cambridge University Press as Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version is free to view and download for personal use only. Not for re-distribution, sale, or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017. The latest version is available athttp://banditalgs.com. Feedback on any aspect is very welcome: [email protected]

8 The Upper Confidence Bound Algorithm: Asymptotic Optimality

The algorithm analyzed in the previous chapter is not anytime. This shortcoming is resolved via a slight modification and a refinement of the analysis. The improved analysis leads to constant factors in the dominant logarithmic term that match a lower bound provided later in Chapter 16.

8.1 Asymptotically optimal UCB

The algorithm studied is shown in Algorithm 6. It differs from the one analyzed in the previous section (Algorithm 3) only by the choice of the confidence level, the choice of which is dictated by the analysis of its regret.

1:Input k 2:Choose each arm once 3:Subsequently choose

At= argmaxi

(

μˆi(t−1) +

√

2 logf(t) Ti(t−1)

)

wheref(t) = 1 +tlog^2 (t)

Algorithm 6:Asymptotically optimal UCB.

The regret bound for Algorithm 6 is more complicated than the bound for Algorithm 3 (see Theorem 7.1). The dominant terms in the two results have the same order, but the gain here is that in this result the leading constant, governing the asymptotic rate of growth of regret, is smaller.

Theorem8.1.For any 1 -subgaussian bandit, the regret of Algorithm 6 satisfies

Rn≤

∑

i:∆i> 0

inf ε∈(0,∆i)

∆i



1 +^5

ε^2

+

2

(

logf(n) +

√

πlogf(n) + 1

)

(∆i−ε)^2



. (8.1)

Bandit Algorithms

8 The Upper Confidence Bound Algorithm: Asymptotic Optimality

8.1 Asymptotically optimal UCB

(

√

)

∑



1 +^5

+

2

(

√

)



. (8.1)

Get our desktop app

Company

Features

Documentation

Resources