8.2 Notes 115
we use Lemma 8.2 to get
E
[n
∑
t=1
I
{
μˆi(t−1) +
√
2 logf(t)
Ti(t−1)
≥μ 1 −εandAt=i
}]
≤E
[n
∑
t=1
I
{
ˆμi(t−1) +
√
2 logf(n)
Ti(t−1)
≥μ 1 −εandAt=i
}]
≤E
[n
∑
s=1
I
{
ˆμis+
√
2 logf(n)
s
≥μ 1 −ε
}]
=E
[n
∑
s=1
I
{
ˆμis−μi+
√
2 logf(n)
s
≥∆i−ε
}]
≤1 +
2
(∆i−ε)^2
(
logf(n) +
√
πlogf(n) + 1
)
The first part of the theorem follows by substituting the results of the previous
two displays into(8.4). The second part follows by choosingε=log−^1 /^4 (n) and
taking the limit asntends to infinity.
8.2 Notes
1 The improvement to the constants comes from making the confidence interval
slightly smaller, which is made possible by a more careful analysis. The main
trick is the observation that we do not need to show thatˆμ 1 s≥μ 1 for alls
with high probability, but instead that ˆμ 1 s≥μ 1 −εfor smallε.
2 The choice off(t) = 1 +tlog^2 (t) looks quite odd. As we pointed out in the
proof, things would not have gone through had we chosenf(t) =t. With a
slightly messier calculation we could have chosenf(t) =tlogα(t) for anyα >0.
If the rewards are actually Gaussian, then a more careful concentration analysis
allows one to choosef(t) =tor even some slightly slower growing function
[Katehakis and Robbins, 1995, Lattimore, 2016a, Garivier et al., 2016b].
3 The asymptotic regret is often indicative of finite-time performance. The reader
is advised to be cautious, however. The lower-order terms obscured by the
asymptotics can be dominant in all practical regimes and problems can exhibit
phase transitions where the asymptotics exhibit a very different behavior than
intermediate regimes.
8.3 Bibliographic remarks
Lai and Robbins [1985] designed policies for which Eq. (8.2) holds. They also
proved a lower bound showing that no ‘reasonable’ policy can improve on this
bound for any problem, where ‘reasonable’ means that they suffer subpolynomial