29.5 Notes 333
overcome this annoyance is to apply the adversarial analysis on the event that
|Yt|≤Cfor some constantC >0 that is sufficiently large that the probability
that this event occurs is high. For example, ifηtis a standard Gaussian and
supa∈A|〈a,θ〉|≤1, thenCmay be chosen to be 1 +
√
4 log(n)and the failure
event that there exists atsuch that|〈At,θ〉+ηt|≥Chas probability at most
1 /nby Theorem 5.3 and a union bound.
2 The mirror descent analysis of adversarial linear bandits also works for
stochastic bandits. Recall that mirror descent samplesAtfrom a distribution
with a conditional mean ofA ̄tand suppose thatθˆtis a conditionally unbiased
estimator ofθ. Then the regret for a stochastic linear bandit with optimal
actiona∗can be rewritten as
Rn=E
[n
∑
t=1
〈At−a∗,θ〉
]
=E
[n
∑
t=1
〈A ̄t−a∗,θ〉
]
=E
[n
∑
t=1
〈A ̄t−a∗,θˆt〉
]
,
which is in the standard format necessary for the analysis of mirror
descent/FTRL. In the stochastic setting the covariance of the least squares
estimatorˆθtwill not be the same as in the adversarial setting, however, which
leads to different results. Whenθˆtis biased, the bias term can be incorporated
into the above formula and then bounded separately.
3 Consider a stochastic bandit withA=B 2 dthe unit ball andYt=〈At,θ〉+ηt
where|Yt| ≤1 almost surely and‖θ‖ 2 ≤1. Adapting the analysis of the
algorithm in Section 28.4 leads to a bound ofRn=O(d
√
nlog(n)). Essentially
the only change is the variance calculation, which increases by roughly a factor
ofd. The details of this calculation are left to you in Exercise 29.2. WhenAis
finite the analysis of Exp3 with Kiefer–Wolfowitz exploration (Theorem 27.1)
leads to an algorithm for whichRn=O(
√
dnlog(k)). For convexAyou can
use continuous exponential weights (Section 27.3).
4 You might wonder whether or not an adversarial bandit algorithm is well
behaved for stochastic bandits where the model is almost linear. Suppose the
loss is nearly linear in the sense that
Yt=`(At) +ηt,
where`(At) =〈At,θ〉+ε(At) andε:A →Ris some function with small
supremum norm. Becauseε(At) depends on the chosen action it is not possible
to writeYt=〈At,θt〉forθtindependent ofAt. WhenA=B 2 dis the unit
ball you will show in Exercise 29.4 that an appropriately tuned instantiation
of follow the regularized leader satisfiesRn=O(d
√
nlog(n)+εn) where
ε=supa∈Aε(a). We currently do not have such a result for different action
sets.