Bandit Algorithms

(Jeff_L) #1
8.2 Notes 115

we use Lemma 8.2 to get


E


[n

t=1

I


{


μˆi(t−1) +


2 logf(t)
Ti(t−1)

≥μ 1 −εandAt=i

}]


≤E


[n

t=1

I


{


ˆμi(t−1) +


2 logf(n)
Ti(t−1)

≥μ 1 −εandAt=i

}]


≤E


[n

s=1

I


{


ˆμis+


2 logf(n)
s

≥μ 1 −ε

}]


=E


[n

s=1

I


{


ˆμis−μi+


2 logf(n)
s

≥∆i−ε

}]


≤1 +


2


(∆i−ε)^2

(


logf(n) +


πlogf(n) + 1

)


The first part of the theorem follows by substituting the results of the previous
two displays into(8.4). The second part follows by choosingε=log−^1 /^4 (n) and
taking the limit asntends to infinity.


8.2 Notes


1 The improvement to the constants comes from making the confidence interval
slightly smaller, which is made possible by a more careful analysis. The main
trick is the observation that we do not need to show thatˆμ 1 s≥μ 1 for alls
with high probability, but instead that ˆμ 1 s≥μ 1 −εfor smallε.
2 The choice off(t) = 1 +tlog^2 (t) looks quite odd. As we pointed out in the
proof, things would not have gone through had we chosenf(t) =t. With a
slightly messier calculation we could have chosenf(t) =tlogα(t) for anyα >0.
If the rewards are actually Gaussian, then a more careful concentration analysis
allows one to choosef(t) =tor even some slightly slower growing function
[Katehakis and Robbins, 1995, Lattimore, 2016a, Garivier et al., 2016b].
3 The asymptotic regret is often indicative of finite-time performance. The reader
is advised to be cautious, however. The lower-order terms obscured by the
asymptotics can be dominant in all practical regimes and problems can exhibit
phase transitions where the asymptotics exhibit a very different behavior than
intermediate regimes.

8.3 Bibliographic remarks


Lai and Robbins [1985] designed policies for which Eq. (8.2) holds. They also
proved a lower bound showing that no ‘reasonable’ policy can improve on this
bound for any problem, where ‘reasonable’ means that they suffer subpolynomial
Free download pdf