This material will be published by Cambridge University Press as Bandit
Algorithms by Tor Lattimore and Csaba Szepesvari. This pre-publication version
is free to view and download for personal use only. Not for re-distribution, sale,
or use in derivative works.©Tor Lattimore and Csaba Szepesvari 2017.
The latest version is available athttp://banditalgs.com. Feedback on any
aspect is very welcome: [email protected]
8 The Upper Confidence Bound Algorithm: Asymptotic Optimality
The algorithm analyzed in the previous chapter is not anytime. This shortcoming
is resolved via a slight modification and a refinement of the analysis. The improved
analysis leads to constant factors in the dominant logarithmic term that match a
lower bound provided later in Chapter 16.
8.1 Asymptotically optimal UCB
The algorithm studied is shown in Algorithm 6. It differs from the one analyzed
in the previous section (Algorithm 3) only by the choice of the confidence level,
the choice of which is dictated by the analysis of its regret.
1:Input k
2:Choose each arm once
3:Subsequently choose
At= argmaxi
(
μˆi(t−1) +
√
2 logf(t)
Ti(t−1)
)
wheref(t) = 1 +tlog^2 (t)
Algorithm 6:Asymptotically optimal UCB.
The regret bound for Algorithm 6 is more complicated than the bound for
Algorithm 3 (see Theorem 7.1). The dominant terms in the two results have the
same order, but the gain here is that in this result the leading constant, governing
the asymptotic rate of growth of regret, is smaller.
Theorem8.1.For any 1 -subgaussian bandit, the regret of Algorithm 6 satisfies
Rn≤
∑
i:∆i> 0
inf
ε∈(0,∆i)
∆i
1 +^5
ε^2
+
2
(
logf(n) +
√
πlogf(n) + 1
)
(∆i−ε)^2
. (8.1)