Bandit Algorithms

(Jeff_L) #1
18.5 Notes 221

of the regret, but also increases the regret. This is similar to what we have
observed in stochastic bandits. Tuning an algorithm for a restricted environment
class usually allows faster learning, but the resulting algorithms can fail when
interacting with an environment that does not belong to the restricted class.
2 The Exp4 algorithm serves as a tremendous building block for other bandit
problems by defining your own experts. An example is the application of Exp4
to nonstationary bandits that we explore in Chapter 31, which is one of the
rare cases where Exp4 can be computed efficiently with a combinatorially large
number of experts. When Exp4 does not have an efficient implementation, it
often provides a good starting place to derive regret bounds without worrying
about computation (for an example, see Exercise 18.5).
3 The bandits with expert advice framework is clearly more general than
contextual bandits. With the terminology of the bandits with expert advice
framework, the contextual bandit problem arises when the experts are given
by staticC →[k] maps.


4 A significant challenge is that a naive implementation of Exp4 has running
timeO(M+k) per round, which can be enormous if eitherMorkis large. In
general there is no solution to this problem, but in some cases the computation
can be reduced significantly. One situation where this is possible is when the
learner has access to anoptimization oraclethat for any context/reward
sequence returns the expert that would collect the most reward in this sequence
(this is equivalent to solving the offline problem Eq. (18.8)). In Chapter 30
we show how to use an offline optimization oracle to learn efficiently in
combinatorial bandit problems. The idea is to solve a randomly perturbed
optimization problem (leading to the so-called follow-the-perturbed leader class
of algorithms) and then show that the randomness in the outputs provides
sufficient exploration. However, as we shall see there, these algorithms will
have some extra information, which makes estimating the rewards possible.
5 In the stochastic contextual bandit problem it is assumed that the
context/reward pairs form a sequence of independent and identically distributed
random variables. Let Φ be a set of functions fromCto [k] and suppose the
learner has access to an optimization oracle capable of finding


argmaxφ∈Φ

∑t

s=1

xsφ(cs)

for any sequence of reward vectorsx 1 ,...,xtand contextsc 1 ,...,ct. A simple
and efficient algorithm that exploits such an oracle is based on explore-then-
commit, which hasO(n^2 /^3 ) regret (Exercise 18.8). There is a more sophisticated
algorithm that is still polynomial-time and for which the regret is about the
same as the result in Theorem 18.1 [Agarwal et al., 2014]. The algorithm
computes importance-weighted estimates of the rewards in each round. These
are used to estimate the regret of all the experts. Based on this, a distribution
over the experts (with a small support) is computed by solving a feasibility
Free download pdf