Bandit Algorithms

(Jeff_L) #1
19.6 Exercises 237

This exercise is inspired by the work of Perchet and Rigollet [2013] who focus
on improving the regret bound by adaptive discretization when a certain
margin condition holds. There are many variations of the problem of the
previous exercise. For starters, the domain of contexts could be more general,
one may consider higher order smoothness, continuous action and context
spaces. What is the role of the context distribution? In some applications,
the context distribution can be estimated for free so perhaps one could as
well assume that the context distribution is known. How to take a known
context distribution into account? (To wet your appetite, if the context
distribution is concentrated on a handful of contexts, the discretization
should perhaps respect which contexts the distribution is concentrated on.)
Instead of discretization, one may also consider function approximation. A
particularly interesting approach that goes beyond mere discretization is due
to Combes et al. [2017] (see also Magureanu et al. 2014). In these papers
the approach is to derive an asymptotic, instance dependent lower bound,
which is then used to guide the algorithm (much like in the track-and-stop
algorithm in Section 33.2). Here, an open problem is to design algorithms that
are simultaneously near minimax optimal and asymptotically optimal. As
described in in Part II, this problem is now settled for finite-armed stochastic
bandits, the only case where we can say this in the whole literature of
bandits.

Free download pdf