Bandit Algorithms

29.5 Notes 333

overcome this annoyance is to apply the adversarial analysis on the event that |Yt|≤Cfor some constantC >0 that is sufficiently large that the probability that this event occurs is high. For example, ifηtis a standard Gaussian and supa∈A|〈a,θ〉|≤1, thenCmay be chosen to be 1 +

√

4 log(n)and the failure event that there exists atsuch that|〈At,θ〉+ηt|≥Chas probability at most 1 /nby Theorem 5.3 and a union bound.

2 The mirror descent analysis of adversarial linear bandits also works for
stochastic bandits. Recall that mirror descent samplesAtfrom a distribution
with a conditional mean ofA ̄tand suppose thatθˆtis a conditionally unbiased
estimator ofθ. Then the regret for a stochastic linear bandit with optimal
actiona∗can be rewritten as

Rn=E

[n ∑

t=1

〈At−a∗,θ〉

]

=E

[n ∑

t=1

〈A ̄t−a∗,θ〉

]

=E

[n ∑

t=1

〈A ̄t−a∗,θˆt〉

]

,

which is in the standard format necessary for the analysis of mirror descent/FTRL. In the stochastic setting the covariance of the least squares estimatorˆθtwill not be the same as in the adversarial setting, however, which leads to different results. Whenθˆtis biased, the bias term can be incorporated into the above formula and then bounded separately.

3 Consider a stochastic bandit withA=B 2 dthe unit ball andYt=〈At,θ〉+ηt
where|Yt| ≤1 almost surely and‖θ‖ 2 ≤1. Adapting the analysis of the
algorithm in Section 28.4 leads to a bound ofRn=O(d

√

nlog(n)). Essentially the only change is the variance calculation, which increases by roughly a factor ofd. The details of this calculation are left to you in Exercise 29.2. WhenAis finite the analysis of Exp3 with Kiefer–Wolfowitz exploration (Theorem 27.1) leads to an algorithm for whichRn=O(

√

dnlog(k)). For convexAyou can use continuous exponential weights (Section 27.3).

4 You might wonder whether or not an adversarial bandit algorithm is well
behaved for stochastic bandits where the model is almost linear. Suppose the
loss is nearly linear in the sense that

Yt=`(At) +ηt,

where`(At) =〈At,θ〉+ε(At) andε:A →Ris some function with small supremum norm. Becauseε(At) depends on the chosen action it is not possible to writeYt=〈At,θt〉forθtindependent ofAt. WhenA=B 2 dis the unit ball you will show in Exercise 29.4 that an appropriately tuned instantiation of follow the regularized leader satisfiesRn=O(d

√

nlog(n)+εn) where ε=supa∈Aε(a). We currently do not have such a result for different action sets.

Bandit Algorithms

√

]

=E

]

=E

]

,

√

√

√

Get our desktop app

Company

Features

Documentation

Resources