Bandit Algorithms

28.3 Online learning for bandits 316

Both Theorem 28.4 and Theorem 28.5 were presented for the oblivious case where (yt)nt=1are chosen in advance. This assumption was not used, however, and in fact the bounds continue to hold whenytis chosen strategically as a function ofa 1 ,y 1 ,...,yt− 1 ,at. This is analogous to how the basic regret bound for exponential weights continues to hold in the face of strategic losses. But be cautioned, this result does not carry immediately to the application of mirror descent to bandits as discussed at the end in Note 8.

28.3 Online learning for bandits

We now consider the application of mirror descent and follow the regularized leader to bandit problems. Like in the previous chapter the adversary chooses a sequence of vectorsy 1 ,...,ynwithyt∈ L ⊂Rd. In each round the learner choosesAt∈A⊂RdwhereAis convex and observes〈At,yt〉. The regret relative to actionais

Rn(a) =E

[n ∑

t=1

〈At−a,yt〉

]

.

The regret isRn=maxa∈ARn(a). The application of mirror descent and follow the regularized leader to linear bandits requires two new ideas. First, because the learner only observes〈At,yt〉, the loss vectors need to be estimated from data and it is these estimators that will be used by the algorithm. Because estimation of ytis only possible using randomization, the algorithm cannot play the suggested action of mirror descent, but instead plays a distribution over actions with the same mean as the proposed action. Since the losses are linear, the expected additional regret by playing according to the distribution vanishes. The algorithm is summarized in Algorithm 16. We have switched to capital letters because the actions are now randomized.

Theorem28.10.Suppose that Algorithm 16 is run with Legendre potentialF, convex action setA⊂Rdand learning rateη > 0 such that:

(a) The loss estimators are unbiased:E[Yˆt|A ̄t] =ytfor allt∈[n]. (b)The action set is a subset of the closure of the domain ofF:A⊆cl(dom(F)).

Then the regret for either variant of Algorithm 16 is bounded by

Rn(a)≤E

[

F(a)−F(A ̄ 1 ) η

+

∑n

t=1

〈A ̄t−A ̄t+1,Yˆt〉−

1

η

∑n

t=1

D(A ̄t+1,A ̄t)

]

.

Furthermore, letting

A ̃t+1= argmina∈dom(F)η〈a,Yˆt〉+DF(a,A ̄t)

Bandit Algorithms

28.3 Online learning for bandits

]

.

[

+

1

]

.

Get our desktop app

Company

Features

Documentation

Resources