28.3 Online learning for bandits 316
Both Theorem 28.4 and Theorem 28.5 were presented for the oblivious case
where (yt)nt=1are chosen in advance. This assumption was not used, however,
and in fact the bounds continue to hold whenytis chosen strategically as
a function ofa 1 ,y 1 ,...,yt− 1 ,at. This is analogous to how the basic regret
bound for exponential weights continues to hold in the face of strategic losses.
But be cautioned, this result does not carry immediately to the application
of mirror descent to bandits as discussed at the end in Note 8.
28.3 Online learning for bandits
We now consider the application of mirror descent and follow the regularized
leader to bandit problems. Like in the previous chapter the adversary chooses
a sequence of vectorsy 1 ,...,ynwithyt∈ L ⊂Rd. In each round the learner
choosesAt∈A⊂RdwhereAis convex and observes〈At,yt〉. The regret relative
to actionais
Rn(a) =E
[n
∑
t=1
〈At−a,yt〉
]
.
The regret isRn=maxa∈ARn(a). The application of mirror descent and follow
the regularized leader to linear bandits requires two new ideas. First, because the
learner only observes〈At,yt〉, the loss vectors need to be estimated from data and
it is these estimators that will be used by the algorithm. Because estimation of
ytis only possible using randomization, the algorithm cannot play the suggested
action of mirror descent, but instead plays a distribution over actions with the
same mean as the proposed action. Since the losses are linear, the expected
additional regret by playing according to the distribution vanishes. The algorithm
is summarized in Algorithm 16. We have switched to capital letters because the
actions are now randomized.
Theorem28.10.Suppose that Algorithm 16 is run with Legendre potentialF,
convex action setA⊂Rdand learning rateη > 0 such that:
(a) The loss estimators are unbiased:E[Yˆt|A ̄t] =ytfor allt∈[n].
(b)The action set is a subset of the closure of the domain ofF:A⊆cl(dom(F)).
Then the regret for either variant of Algorithm 16 is bounded by
Rn(a)≤E
[
F(a)−F(A ̄ 1 )
η
+
∑n
t=1
〈A ̄t−A ̄t+1,Yˆt〉−
1
η
∑n
t=1
D(A ̄t+1,A ̄t)
]
.
Furthermore, letting
A ̃t+1= argmina∈dom(F)η〈a,Yˆt〉+DF(a,A ̄t)