Bandit Algorithms

(Jeff_L) #1
28.5 Notes 320

special cases, however, the projection step can be written in closed form or
efficiently approximated.
2 We saw that mirror descent with a carefully chosen potential function achieves
O(



dnlog(n)) regret on the` 2 -ball. On the`∞ball (hypercube) the optimal
regret isO(d


n). Interestingly, asntends to infinity the optimal dependence
on the dimension forA=Bpd={x∈Rd:‖x‖p≤ 1 }is eitherdor


dwith a
complete classification given by Bubeck et al. [2018].
3 Adversarial linear bandits withA=Pk− 1 are essentially equivalent tok-armed
adversarial bandits. There exists a potential such that the resulting algorithm
satisfiesRn=O(



kn), which matches the lower bound up to constant factors
and shaves a factor of


logkfrom the upper bounds presented in Chapters 11
and 12. For more details see Exercise 28.13.
4 Most of the bounds proven for adversarial bandits have a worst case flavor.
The tools in this chapter can often be applied to prove adaptive bounds. In
Exercise 28.12 you will analyze a simple algorithm fork-armed adversarial
bandits for which


Rn=O



√√


√√


k

(


1 + min
a∈[k]

∑n

t=1

yta

)


log

(n
k

)



.


Bounds of this kind are calledfirst order bounds[Allenberg et al., 2006,
Abernethy et al., 2012, Neu, 2015b, Wei and Luo, 2018]. Thelog(n/k) term
can be improved to log(k) using a more sophisticated algorithm/analysis.
5 Both mirror descent and follow the regularized leader depend on the potential
function. Currently there is no characterization of exactly what this potential
should be or how to find it. At least in the full information setting there are
quite general universality results showing that if a certain regret is achievable
by some algorithm, then that same regret is nearly achievable by mirror descent
with some potential [Srebro et al., 2011]. In practice this result is not useful for
constructing new potential functions, however. There have been some attempts
to develop ‘universal’ potential functions that exhibit nice behavior for any
action sets [Bubeck et al., 2015b, and others]. These can be useful, but as yet
we do not know precisely what properties are crucial, especially in the bandit
case.
6 When the horizon is unknown the learning rate cannot be tuned ahead of time.
One option is to apply the doubling trick. A more elegant solution is to use a
decreasing schedule of learning rates. This requires an adaptation of the proofs
of Theorems 28.4 and 28.5, which we outline in Exercises 28.9 and 28.10. This
is one situation where mirror descent and follow the regularized leader are not
the same and where the latter algorithm is usually to be preferred.
7 In much of the literature the potential is chosen in such a way that mirror descent
and follow the regularized leader are the same algorithm. For historical reasons
the name mirror descent is more commonly used in the bandit community. We
encourage the reader to keep both algorithms in mind, since the analysis of

Free download pdf